0% found this document useful (0 votes)
20 views

Advanced Mining Techniques

Uploaded by

naackrmu2023
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Advanced Mining Techniques

Uploaded by

naackrmu2023
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Advanced Mining Techniques

What is statistical analysis?

Statistical analysis, or statistics, involves collecting, organizing and analyzing data based on established
principles to identify patterns and trends. It is a broad discipline with applications in academia, business, the
social sciences, genetics, population studies, engineering and several other fields. Statistical analysis has
several functions. You can use it to make predictions, perform simulations, create models, reduce risk and
identify trends.

Thanks to improving technology, many organizations now have vast amounts of data on every aspect of their
operations and markets. To make sense of this data, businesses rely on statistical analysis techniques to
organize their data and turn this information into tools for making precise decisions and long-term forecasts.
Statistical analysis allows owners of data to perform business intelligence functions that solidify their
competitive advantage, improve efficiency and optimize resources for maximum returns on investments.

Main types of statistical analysis

There are three major types of statistical analysis:

Descriptive statistical analysis


Descriptive statistics is the simplest form of statistical analysis, using numbers to describe the qualities of a
data set. It helps reduce large data sets into simple and more compact forms for easy interpretation. You can
use descriptive statistics to summarize the data from a sample or represent a whole sample in a research
population. Descriptive statistics uses data visualization tools such as tables, graphs and charts to make
analysis and interpretation easier. However, descriptive statistics is not suitable for making conclusions. It can
only represent data so you can apply more sophisticated statistical analysis tools to draw inferences.

Inferential statistical analysis

Inferential statistical analysis is used to make inferences or draw conclusions about a larger population based
on findings from a sample group within it. It can help researchers to find distinctions among groups present
within a sample. Inferential statistics is also used to validate generalizations made about a population from a
sample due to its ability to account for errors in conclusions made about a segment of a larger group.

Associational statistical analysis

Associational statistics is a tool researchers use to make predictions and find causation. They use it to find
relationships among multiple variables. It is also used to determine whether researchers can make inferences
and predictions about a data set from the characteristics of another set of data. Associational statistics is the
most advanced type of statistical analysis and requires sophisticated software tools for performing high-level
mathematical calculations. To measure association, researchers use a wide range of coefficients of variation,
including correlation and regression analysis.

Association Rule
Association rule mining finds interesting associations and relationships among large sets of data items. This
rule shows how frequently a itemset occurs in a transaction. A typical example is a Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show associations between
items.It allows retailers to identify relationships between the items that people buy together frequently.
Association rule learning can be divided into three types of algorithms:

1. Apriori

2. Eclat

3. F-P Growth Algorithm

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands of data items, there are several metrics. These
metrics are given below:

● Support

● Confidence

● Lift

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of
the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be written
as:
Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur
together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that contains
X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

What is Cluster Analysis?

Cluster analysis is a multivariate data mining technique whose goal is to groups objects (eg., products,
respondents, or other entities) based on a set of user selected characteristics or attributes. It is the basic and
most important step of data mining and a common technique for statistical data analysis, and it is used in many
fields such as data compression, machine learning, pattern recognition, information retrieval etc.

Types of Cluster Analysis

The clustering algorithm needs to be chosen experimentally unless there is a mathematical reason to choose
one cluster method over another.It should be noted that an algorithm that works on a particular set of data will
not work on another set of data. There are a number of different methods to perform cluster analysis.

Hierarchical Cluster Analysis


In this method, first, a cluster is made and then added to another cluster (the most similar and closest one) to
form one single cluster. This process is repeated until all subjects are in one cluster. This particular method is
known as Agglomerative method. Agglomerative clustering starts with single objects and starts grouping them
into clusters.

The divisive method is another kind of Hierarchical method in which clustering starts with the complete data
set and then starts dividing into partitions.

Centroid-based Clustering

In this type of clustering, clusters are represented by a central entity, which may or may not be a part of the
given data set. K-Means method of clustering is used in this method, where k are the cluster centers and objects
are assigned to the nearest cluster centres.

Distribution-based Clustering

It is a type of clustering model closely related to statistics based on the modals of distribution. Objects that
belong to the same distribution are put into a single cluster.This type of clustering can capture some complex
properties of objects like correlation and dependence between attributes.
Density-based Clustering

In this type of clustering, clusters are defined by the areas of density that are higher than the remaining of the
data set. Objects in sparse areas are usually required to separate clusters.The objects in these sparse points are
usually noise and border points in the graph.The most popular method in this type of clustering is DBSCAN.
Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form clusters.It is an unsupervised
machine learning-based algorithm that acts on unlabelled data. A group of data points would comprise together
to form a cluster in which all the objects would belong to the same group.

Cluster:
The given data is divided into different groups by combining similar objects into a group. This group is
nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped together.

Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge databases.
In order to handle extensive databases, the clustering algorithm should be scalable. Data should be scalable if it
is not scalable, then we can’t get the appropriate result and would lead to wrong results.

2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the data of
small size.

3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms of
clustering. It should be capable of dealing with different types of data like discrete, categorical and interval-
based data, binary data etc.

4. Dealing with unstructured data: These would be some databases that contain missing values, noisy or
erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality clusters. So it
should be able to handle unstructured data give it some structure to the data by organizing it into groups of
similar data objects. This makes the job of the data expert easier in order to process the data and discover new
patterns.

5. Interpretability: The outcomes of clustering should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

Clustering Methods:
The clustering methods can be classified into the following categories:

● Partitioning Method
● Hierarchical Method
● Density-based Method
● Grid-Based Method
● Model-Based Method
● Constraint-based Method

Applications of Cluster Analysis

● Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
● Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
● In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
● Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.
● Clustering also helps in classifying documents on the web for information discovery.

● Clustering is also used in outlier detection applications such as detection of credit card fraud.

● As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

● Scalability − We need highly scalable clustering algorithms to deal


with large databases.
● Ability to deal with different kinds of attributes − Algorithms
should be capable to be applied on any kind of data such as interval-
based (numerical) data, categorical, and binary data.
● Discovery of clusters with attribute shape − The clustering
algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance measures that
tend to find spherical cluster of small sizes.
● High dimensionality − The clustering algorithm should not only be
able to handle low-dimensional data but also the high dimensional
space.
● Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may
lead to poor quality clusters.
● Interpretability − The clustering results should be interpretable,
comprehensible, and usable.

You might also like