0% found this document useful (0 votes)
6 views

DATA MINING MODULE 2

Data mining is the process of extracting useful information from large datasets to identify patterns and trends that aid in decision-making. It encompasses various functionalities such as data characterization, discrimination, association analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. Data pre-processing is crucial for improving data quality and involves steps like cleaning, integration, transformation, reduction, and discretization to prepare data for effective mining.

Uploaded by

blessonsunil26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DATA MINING MODULE 2

Data mining is the process of extracting useful information from large datasets to identify patterns and trends that aid in decision-making. It encompasses various functionalities such as data characterization, discrimination, association analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. Data pre-processing is crucial for improving data quality and involves steps like cleaning, integration, transformation, reduction, and discretization to prepare data for effective mining.

Uploaded by

blessonsunil26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Mining

The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and assembled
in particular areas such as data warehouses, efficient analysis, data mining algorithm, helping decision
making and other data requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also called
Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Functionalities of data mining
Data mining functionalities are used to represent the type of patterns that have to be discovered in data
mining tasks. In general, data mining tasks can be classified into two types including descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database and the
predictive mining tasks act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows −

Data characterization − It is a summarization of the general characteristics of an object class of data.


The data corresponding to the user-specified class is generally collected by a database query. The output
of data characterization can be presented in multiple forms.

Data discrimination − It is a comparison of the general characteristics of target class data objects with
the general characteristics of objects from one or a set of contrasting classes. The target and contrasting
classes can be represented by the user, and the equivalent data objects fetched through database queries.

Association Analysis − It analyses the set of items that generally occur together in a transactional
dataset. There are two parameters that are used for determining the association rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a transaction when another item occurs.

Classification − Classification is the procedure of discovering a model that represents and distinguishes
data classes or concepts, for the objective of being able to use the model to predict the class of objects
whose class label is anonymous. The derived model is established on the analysis of a set of training
data (i.e., data objects whose class label is common.

Prediction − It defines predict some unavailable data values or pending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It can be a
prediction of missing numerical values or increase/decrease trends in time-related information.

Clustering − It is similar to classification but the classes are not predefined. The classes are represented
by data attributes. It is unsupervised learning. The objects are clustered or grouped, depends on the
principle of maximizing the intraclass similarity and minimizing the intraclass similarity.

Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster. These
are the data objects which have multiple behaviour from the general behaviour of other data objects.
The analysis of this type of data can be essential to mine the knowledge.

Evolution analysis − It defines the trends for objects whose behaviour changes over some time.
Major Issues in Data Mining
Efficiency and scalability of data mining algorithms − It can effectively extract data from a large
amount of data in databases, the knowledge discovery algorithms should be efficient and scalable to
huge databases. Specifically, the running time of a data mining algorithm should be predictable and
acceptable in huge databases. Algorithms with exponential or even channel-order polynomial
complexity will not be of efficient use.

Usefulness, certainty, and expressiveness of data mining results − The identified knowledge should
exactly portray the contents of the database and be beneficial for specific applications. The
imperfectness must be defined by measures of uncertainty, in the form of approximate rules or
quantitative rules.

Noise and exceptional data must be managed elegantly in data mining systems. This also stimulates
a systematic study of measuring the quality of the discovered knowledge, such as interestingness and
reliability, by the development of statistical, analytical, and simulative models and tools.

Expression of various kinds of data mining results − Several kinds of knowledge can be discovered
from a huge amount of data. It can also like to examine discovered knowledge from multiple views and
display them in different forms.
This needed us to define both the data mining requests and the discovered knowledge in high-level
languages or graphical user interfaces so that the data mining task can be defined by non-experts and
the discovered knowledge can be understandable and precisely available by users. This also needed the
discovery system to select expressive knowledge representation techniques.

Interactive mining knowledge at multiple abstraction levels − Because it is complex to predict what
exactly can be discovered from a database, a high-level data mining query must be considered as a
probe that can disclose some interesting traces for further exploration.
Interactive discovery must be encouraged, which enables a user to interactively refine a data mining
request, dynamically change data focusing, progressively deepen a data mining process, and flexibly
view the information and data mining results at several abstraction levels and from multiple angles.

Mining information from different sources of data − The broadly available local and wide-area
computer network, such as the Internet, and can connect various sources of data and form huge
distributed, heterogeneous databases. Mining knowledge from multiple sources of formatted or
unformatted information with diverse data semantics poses a new requirement to data mining.
Otherwise, data mining can help disclose the high-level data regularities in heterogeneous databases
which can barely be discovered by simple query systems. Furthermore, the huge size of the database,
the broad distribution of data, and the computational complexity of several data mining methods
motivate the advancement of parallel and distributed data mining algorithms.
Data Pre-processing in Data Mining
Data pre-processing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data pre-
processing is to improve the quality of the data and to make it more suitable for the specific data mining
task.
Some common steps in data pre-processing include:

Data cleaning: this step involves identifying and removing missing, inconsistent, or irrelevant data.
This can include removing duplicate records, filling in missing values, and handling outliers.
Data integration: this step involves combining data from multiple sources, such as databases,
spreadsheets, and text files. The goal of integration is to create a single, consistent view of the data.
Data transformation: this step involves converting the data into a format that is more suitable for the
data mining task. This can include normalizing numerical data, creating dummy variables, and encoding
categorical data.
Data reduction: this step is used to select a subset of the data that is relevant to the data mining task.
This can include feature selection (selecting a subset of the variables) or feature extraction (extracting
new variables from the data).
Data discretization: this step is used to convert continuous numerical data into categorical data, which
can be used for decision tree and other categorical data mining techniques.

By performing these steps, the data mining process becomes more efficient and the results become more
accurate.

Pre-processing in Data Mining:


Data pre-processing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
Steps Involved in Data Pre-processing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways. Some of
them are:
Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
Fill the Missing values: There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways:
Binning method:
inning method is used to smoothing data or to handle noisy data. In this method, the data is first sorted
and then the sorted values are distributed into a number of buckets or bins. As binning methods consult
the neighbourhood of values, they perform local smoothing. There are three approaches to performing
smoothing –
Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by the mean value
of the bin.
Smoothing by bin median: In this method each bin value is replaced by its bin median value.
Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary
value.
Approach:

Sort the array of a given data set.


Divides the range into N intervals, each containing the approximately same number of samples(Equal-
depth partitioning).
Store mean/ median/ boundaries in each row.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition using equal frequency approach:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing by bin median:
- Bin 1: 9 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Regression:
The data is made smooth with the help of using the regression function. The regression can be linear or
multiple. Linear regression has only one independent variable, and multiple regressions have more than
one independent variable.
Simple Linear Regression:
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.

It is represented by the equation:


Y = a + b*X + e,
where a is the intercept, b is the slope of the regression line and e is the error. X and Y are the predictor
and target variables respectively. When X is made up of more than one variables (or features) it is
termed as multiple linear regression.

The best-fit line is achieved using the Least-Squared method. This method minimizes the sum of the
squares of the deviations from each of the data points to the regression line.

Multiple linear regression:


Multiple linear regression refers to a statistical technique that uses two or more independent variables
to predict the outcome of a dependent variable. The technique enables analysts to determine the
variation of the model and the relative contribution of each independent variable in the total variance.

Clustering:
This method mainly operates on the group. Clustering groups the data in a cluster. Then, the outliers
are detected with the help of clustering. Next, the similar values are then arranged into a "group" or a
"cluster".

Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
Data Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) There are
different methods to normalize the data, as discussed below. Consider that we have a numeric attribute
A and we have n number of observed values for attribute A that are V1, V2, V3, ….Vn.

Min-max normalization:

This method implements a linear transformation on the original data. Let us consider that we have
minA and maxA as the minimum and maximum value observed for attribute A and Vi is the value for
attribute A that has to be normalized. The min-max normalization would map Vi to the V'i in a new
smaller range [new_minA, new_maxA].
The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income,
and [0.0, 1.0] is the range in which we have to map a value of $73,600. The value $73,600 would be
transformed using min-max normalization as follows:
Z-score normalization:

This method normalizes the value for attribute A using the mean and standard deviation. The following
formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.

Decimal Scaling:

This method normalizes the value of attribute A by moving the decimal point in the value. This
movement of a decimal point depends on the maximum absolute value of A.The formula for
the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1For example, the observed values for attribute A
range from -986 to 917, and the maximum absolute value for attribute A is 986. Here, to normalize each
value of attribute A using decimal scaling, we have to divide each value of attribute A by 1000, i.e.,
j=3.So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute value must be
preserved to normalize the future data uniformly.

Attribute Selection:
In the attribute construction method, the new attributes consult the existing attributes to construct a new
data set that eases data mining. New attributes are created and applied to assist the mining process from
the given attributes. This simplifies the original data and makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we may have
the height and width of each plot. So here, we can construct a new attribute 'area' from attributes 'height'
and 'weight'. This also helps understand the relations among the attributes in a data set.

Discretization:

Data discretization refers to a method of converting a huge number of data values into smaller ones so
that the evaluation and management of data become easy. In other words, data discretization is a method
of converting attributes values of continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization.
Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
Suppose we have an attribute of Age with the given values

Table before Discretization

Concept Hierarchy Generation:


The term hierarchy represents an organizational structure or mapping in which items are ranked
according to their levels of importance. In other words, we can say that a hierarchy concept refers to a
sequence of mappings with a set of more general concepts to complex concepts. It means mapping is
done from low-level concepts to high-level concepts. For example, in computer science, there are
different types of hierarchical systems. A document is placed in a folder in windows at a specific place
in the tree structure is the best example of a computer hierarchical tree model. There are two types of
hierarchy: top-down mapping and the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.

Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the
bottom to the specialized information.

Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with
the top to the generalized information.
Data Reduction:
1. Dimensionality Reduction
2. sNumerosity Reduction
3. Discretization Operation

1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant features.
Here are three methods of dimensionality reduction.

• Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into
a numerically different data vector A' such that both A and A' vectors are of the same length.
Then how it is useful in reducing data because the data obtained from the wavelet transform
can be truncated. The compressed data is obtained by retaining the smallest fragment of the
strongest wavelet coefficients. Wavelet transform can be applied to data cubes, sparse data, or
skewed data.

56 40 8 24 48 48 40 16
48 16 48 28 8 -8 0 12
32 38 16 10 8 -8 0 12
35 -3 16 10 8 -8 0 12

32 38 16 10 8 -8 0 12
48 16 48 28 8 -8 0 12
56 40 8 24 48 48 40 16

AVG(i)=(A[i]+A[i+1])/2
A(i+)-AVG(i)
(56+48)/2=48
56-48=8
56 40 8 24 48 48 40 16
48 16 48 28 8 -8 0 12
(56+48)/2 (8+24)/2 (48+48)/2 (40+16)/2 56-48 8-16 48-48 40-28
32 38 16 10 8 -8 0 12
35 3 16 10 8 -8 0 12

32 38 16 10 8 -8 0 12
48 16 48 28 8 -8 0 12
56 40 8 24 48 48 40 16
• Principal Component Analysis: Suppose we have a data set to be analyzed that has
tuples with n attributes. The principal component analysis identifies k independent tuples within
attributes that can represent the data set. In this way, the original data can be cast on a much
smaller space, and dimensionality reduction can be achieved. Principal component analysis can
be applied to sparse and skewed data.

• Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection reduces the
data volume and dimensionality. The attribute subset selection reduces the volume of data by
eliminating redundant and irrelevant attributes. The attribute subset selection ensures that we
get a good subset of original attributes even after eliminating the unwanted attributes. The
resulting probability of data distribution is as close as possible to the original data distribution
using all the attributes.

2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller form.
This technique includes two types parametric and non-parametric numerosity reduction.

Parametric: Parametric numerosity reduction incorporates storing only data parameters instead of
the original data. One method of parametric numerosity reduction is the regression and log-linear
method.

• Regression and Log-Linear: Linear regression models a relationship between the two
attributes by modelling a linear equation to the data set. Suppose we need to model a linear
function between two attributes. y = wx +b Here, y is the response attribute, and x is the
predictor attribute. If we discuss in terms of data mining, attribute x and attribute y are the
numeric database attributes, whereas w and b are regression coefficients. Multiple linear
regressions let the response variable y model linear function between two or more predictor
variables. Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space. Then the log-
linear model is used to study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.

Non-Parametric: A non-parametric numerosity reduction technique does not assume any model.
The non-Parametric technique results in a more uniform reduction, irrespective of data size, but it may
not achieve a high volume of data reduction like the parametric. There are at least four types of non-
Parametric data reduction techniques, Histogram, Clustering, Sampling, Data Cube Aggregation, and
Data Compression.
• Histogram: A histogram is a graph that represents frequency distribution which describes
how often a value appears in the data. Histogram uses the binning method to represent an
attribute's data distribution. It uses a disjoint subset which we call bin or buckets. A histogram
can represent a dense, sparse, uniform, or skewed data. Instead of only one attribute, the
histogram can be implemented for multiple attributes. It can effectively represent up to five
attributes.

• Clustering: Clustering techniques groups similar objects from the data so that the objects in
a cluster are similar to each other, but they are dissimilar to objects in another cluster. How
much similar are the objects inside a cluster can be calculated using a distance function. More
is the similarity between the objects in a cluster closer they appear in the cluster. The quality of
the cluster depends on the diameter of the cluster, i.e., the max distance between any two objects
in the cluster. The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.

• Sampling: One of the methods used for data reduction is sampling, as it can reduce the large
data set into a much smaller data sample. Below we will discuss the different methods in which
we can sample a large data set D containing N tuples:

1. Simple random sample without replacement (SRSWOR) of size s: In this s, some


tuples are drawn from N tuples such that in the data set D (s<N). The probability of
drawing any tuple from the data set D is 1/N. This means all tuples have an equal
probability of getting sampled.

2. Simple random sample with replacement (SRSWR) of size s: It is similar to


the SRSWOR, but the tuple is drawn from data set D, is recorded, and then replaced
into the data set D sothat it can be drawn again.

3. Cluster sample: The tuples in data set D are clustered into M mutually disjoint
subsets. The data reduction can be applied by implementing SRSWOR on these
clusters. A simple random sample of size s could be generated from these clusters
where s<M.

4. Stratified sample: The large data set D is partitioned into mutually disjoint sets
called 'strata'. A simple random sample is taken from each stratum to get stratified
data. This method is effective for skewed data.

• Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018
to the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales
per quarter for each year. In this way, aggregation provides you with the required data, which
is much smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.

• Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.

This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a
simple and minimal data size reduction. Lossless data compression uses algorithms
to restore the precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may
differ from the original data but are useful enough to retrieve information from
them. For example, the JPEG image format is a lossy compression, but we can find
the meaning equivalent to the original image. Methods such as the Discrete Wavelet
transform technique PCA (principal component analysis) are examples of this
compression.

3. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes with labels of small intervals. This means
that mining results are shown in a concise and easily understandable way.
Top-down discretization: If you first consider one or a couple of points (so-called breakpoints or
split points) to divide the whole set of attributes and repeat this method up to the end, then the process
is known as top-down discretization, also known as splitting.

Bottom-up discretization: If you first consider all the constant values as split-points, some are
discarded through a combination of the neighbourhood values in the interval. That process is called
bottom-up discretization.

Association Rule
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a Market
Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy together
frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several
metrics. These metrics are given below:
• Support
• Confidence
• Lift
Support says how popular an item is, as measured in the proportion of transactions in which an item
set appears.
Confidence says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}.
Thus, it is measured by the proportion of transaction with item X in which item Y also appears.
Confidence might misrepresent the importance of association.

Lift says how likely item Y is purchased when item X is purchased while controlling for how popular
item Y is.

A customer does 4 transactions with you. In the first transaction, she buys 1 apple, 1 beer, 1 rice, and 1
chicken. In the second transaction, she buys 1 apple, 1 beer, 1 rice. In the third transaction, she buys 1
apple, 1 beer only. In fourth transactions, she buys 1 apple and 1 orange.
Support (Apple) = 4/4 So, Support of {Apple} is 4 out of 4 or 100%
Confidence (Apple -> Beer) = Support (Apple, Beer)/Support (Apple)
= (3/4)/(4/4)
= 3/4
So, Confidence of {Apple -> Beer} is 3 out of 4 or 75%
Lift (Beer -> Rice) = Support (Beer, Rice)/(Support (Beer) * Support(Rice))
= (2/4)/(3/4) * (2/4)
= 1.33
So, Lift value is greater than 1 implies Rice is likely to be bought if Beer is bought.

Types of Association Rule Mining Algorithms

Horizontal vs Vertical Data Format


Existing mining algorithm of association rules can be broadly divided into two main categories: —
Horizontal format mining algorithms and Vertical format mining algorithms. We have a matrix which
shows transactions with items, this kind of matrix can be represented a horizontal or a vertical way.

The most commonly used layout is the horizontal data layout. That is, each transaction has a transaction
identifier (TID) and a list of items occurring in that transaction, i.e., {TID:itemset}. Another commonly
used layout is the vertical data layout n which the database consists of a set of items, each followed by
the set of transaction identifiers containing the item, i.e., {item:TID_set}. Table 1 shows the horizontal
layout and table 2 shows the vertical layout:
Horizontal

Vertical
Apriori algorithm uses horizontal format while Eclat can be used only for vertical format data sets.
Apriori algorithm (Horizontal)
Apriori algorithm uses frequent itemsets to generate association rules. It is based on the concept that a
subset of a frequent itemset must also be a frequent itemset. Frequent Itemset is an itemset whose
support value is greater than a threshold value(support).

Steps for Apriori Algorithm


Step-1: Determine the support of itemsets in the transactional database, and select the minimum support
and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or selected
support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or
minimum confidence.
Step-4: Sort the rules as the decreasing order of lift.

Steps in detail
Step-1: K=1
• Create a table containing support count of each item present in dataset –Called C1(candidate
set)
• compare candidate set item’s support count with minimum support count. This gives us itemset
L1.

Step-2: K=2

• Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-
1 is that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.
• Now find support count of these itemsets by searching in dataset.
• compare candidate (C2) support count with minimum support count this gives us itemset L2.

Continue this process until no frequent itemsets are found further

Step-1: Calculating C1 and L1:


Step-2: Candidate Generation C2, and L2:

Step-3: Candidate generation C3, and L3:

Step-4: Finding the association rules for the subsets:


ECLAT Algorithm
Eclat algorithm is data mining algorithm which is used to find frequent items. As we already know,
some association rule mining algorithm uses horizontal data format and some of them uses a vertical
data format for generation of frequent itemsets. Eclat cannot use horizontal database. If there is any
horizontal database, then we need to convert into vertical database.

This vertical approach of the ECLAT algorithm makes it a faster algorithm than the Apriori (but
sometimes when intermediate results of vertical tid lists become too large for memory, thus affecting
the algorithm scalability). In Apriori algorithm we need to scan database again and again for finding
frequent itemsets, this limitation is reduced by using vertical dataset in Eclat. Eclat needs to scan the
database only once.

While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph,
the ECLAT algorithm works in a vertical manner just like the Depth-First Search of a graph which is
usually more fast than Breadth-First search.
Eclat uses a purely vertical transaction representation. No subset tests and no subset generation are
needed to compute the support. Generally, Transaction Id sets which are also called as tidsets are used
to calculate the value of Support value. The support of item sets is determined by intersecting transaction
lists.
Let’s see an example of how Eclat works on vertical database:

Note: Using Eclat we only count support, because we only have item-sets and their supports. While we
are not creating the rules, we do not need to calculate the confidence.

Step 1 — List the Transaction ID (TID) set of each product

The first step is to make a list that contains, for each product, a list of the Transaction IDs in which the
product occurs. This list is represented in the following table.
The ECLAT Algorithm. The Transaction ID (TID) sets for each product.

These transaction ID lists are is called the Transaction ID Set, also called TID set.

Step 2 — Filter with minimum support

The next step is to decide on a value called the minimum support. The minimum support will serve to
filter out products that do not occur often enough to be considered.

In the current example, we will choose a value of 7 for the minimum support. As you can see in the
table of Step 1, there are two products that have a TID set that contains less than 7 transactions: Flour
and Butter. Therefore, we will filter them out, and we obtain the following table:

The ECLAT Algorithm. Filtering out products that do not reach minimum support.

Step 3 — Compute the Transaction ID set of each product pair. We now move on to pairs of products.
We will basically repeat the same thing as in step 1, but now for product pairs. The interesting thing
about the ECLAT algorithm is that this step is done using the Intersection of the two original sets.
This makes it different from the Apriori algorithm. The ECLAT algorithm is faster because it is much
simpler to identify the intersection of the set of transactions IDs than to scan each individual
transaction for the presence of pairs of products (as Apriori does). You can see in the below image how
it's easy to filter out the transaction IDs that are common between the product pair Wine and Cheese:

The ECLAT Algorithm. Finding the intersection of Transactions IDs is easier than scanning the whole
database When doing the intersection for each product pair (ignoring the products that did not reach
support individually) this gives the following table:
The Transaction ID sets for all product pairs that are still in the race.

Step 4 — Filter out the pairs that do not reach minimum support As before, we need to filter out results
that do not reach the minimum support of 7. This leaves us with only two remaining product pairs: Wine
& Cheese and Beer & Potato Chips.

The ECLAT Algorithm. There are two product pairs that meet support.

Step 5— Continue as long as you can make new pairs above support From this point on, you repeat the
steps as long as possible. For the current example, if we create the product pairs of three products, you’ll
find that there aren’t any groups of three that reach the minimum support level. Therefore, the
association rules will be those obtained in the previous step.

Frequent Pattern (FP) Growth Algorithm

The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The core
of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which
retains the item set association information.
This algorithm works as follows:

First, it compresses the input database creating an FP-tree instance to represent frequent items.
After this first step, it divides the compressed database into a set of conditional databases, each
associated with one frequent pattern.
Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns
and then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this
problem is to partition the database into a set of smaller databases (called projected databases) and then
construct an FP-tree from each of these smaller databases.

FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information about
frequent patterns in a database. Each transaction is read and then mapped onto a path in the FP-tree.
This is done until all transactions have been read. Different transactions with common subsets allow
the tree to remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree is
to mine the most frequent pattern. Each node of the FP tree represents an item of the item set.

The root node represents null, while the lower nodes represent the item sets. The associations of the
nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while forming
the tree.

Han defines the FP-tree as the tree structure given below:

1. One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-
header table.
2. Each node in the item-prefix subtree consists of three fields:
• Item-name: registers which item is represented by the node;
• Count: the number of transactions represented by the portion of the path reaching the
node;
• Node-link: links to the next node in the FP-tree carrying the same item name or null if
there is none.
3. Each entry in the frequent-item-header table consists of two fields:
• Item-name: as the same to the node;
• Head of node-link: a pointer to the first node in the FP-tree carrying the item name.
Additionally, the frequent-item-header table can have the count support for an item. The below diagram
is an example of a best-case scenario that occurs when all transactions have the same itemset; the size
of the FP-tree will be only a single branch of nodes.

Example

The given data is a hypothetical dataset of transactions with each letter representing an item. The
minimum support given is 3.

TID Items Bought


100 f, a, c, d, g, i, m, p
200 a, b, c, f, l, m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n

In the frequent pattern growth algorithm, first, we find the frequency of each item. The following table
gives the frequency of each item in the given data.

Item Frequency Item Frequency


a 3 j 1
b 3 k 1
c 4 l 2
d 1 m 3
e 1 n 1
f 4 o 2
A Frequent Pattern set (L) is built, which will contain all the elements whose frequency is greater than
or equal to the minimum support.
These elements are stored in descending order of their respective frequencies.

As minimum support is 3.
After insertion of the relevant items, the set L looks like this: -
L = { (f:4), (c:4), (a:3), (b:3), (m:3), (p:3) }

Now, for each transaction, the respective Ordered-Item set is built.


Frequent Pattern set L = { (f:4), (c:4), (a:3), (b:3), (m:3), (p:3) }

TID Items Bought (Ordered) Frequent Items


100 f, a, c, d, g, i, m, p f, c, a, m, p
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o f, b
400 b, c, k, s, p c, b, p
500 a, f, c, e, l, p, m, n f, c, a, m, p

Now, all the Ordered-Item sets are inserted into a Tire Data Structure (frequent pattern tree).

Create root
Now, for each item, the Conditional Pattern Base is computed which is the path labels of all the paths
which lead to any node of the given item in the frequent-pattern tree.

Item Conditional Pattern Base


p {{f, c, a, m : 2}, {c, b : 1}}
m {{f, c, a : 2}, {f, c, a, b : 1}}
b {{f, c, a : 1}, {f : 1}, {c : 1}}
a {{f, c : 3}}
c {{f : 3}}
f Φ

Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of elements
that is common in all the paths in the Conditional Pattern Base of that item and calculating its support
count by summing the support counts of all the paths in the Conditional Pattern Base.

Item Conditional Pattern Base Conditional FP-Tree


p {{f, c, a, m : 2}, {c, b : 1}} {c : 3}
m {{f, c, a : 2}, {f, c, a, b : 1}} {f, c, a :3}
b {{f, c, a : 1}, {f : 1}, {c : 1}} Φ
a {{f, c : 3}} {f, c : 3}
c {{f : 3}} {f : 3}
f Φ Φ

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the
items of the Conditional Frequent Pattern Tree set to the corresponding item.

Conditional Conditional Frequent


Item Pattern Base FP-Tree Patterns Generated
p {{f, c, a, m : 2}, {c, b : 1}} {c : 3} {<c, p : 3>}
{ <f, m : 3>, <c, m : 3> ,<a, m : 3>, <f, c, m :
m {{f, c, a : 2}, {f, c, a, b : 1}} {f, c, a :3} 3> <f, a, m : 3>, <c, a, m :3>}
b {{f, c, a : 1}, {f : 1}, {c : 1}} Φ {}
a {{f, c : 3}} {f, c : 3} {<f, a : 3>, <c, a : 3>, <f, c, a:3>}
c {{f : 3}} {f : 3} { <f, c : 3>}
f Φ Φ {}

For each row, two types of association rules can be inferred.


For example for the first row which contains the element, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is calculated and the one with confidence
greater than or equal to the minimum confidence value is retained.

Mining Various Kinds of Association Rules

Association rule learning is a machine learning technique used for discovering interesting relationships
between variables in large databases. It is designed to detect strong rules in the database based on some
interesting metrics. For any given multi-item transaction, association rules aim to obtain rules that
determine how or why certain items are linked. Association rules are created for finding information
about general if-then patterns using specific criteria with support and trust to define what the key
relationships are. They help to show the frequency of an item in specific data since confidence is defined
by the number of times an if-then statement is found to be true.
Types of Association Rules:
There are various types of association rules in data mining: -
• Multi-relational association rules
• Generalized association rules
• Quantitative association rules
• Interval information association rules

1. multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new class of


association rules, different from original, simple, and even multi-relational association rules (usually
extracted from multi-relational databases), each rule element consists of one entity but many a
relationship. These relationships represent indirect relationships between entities.

2. Generalized association rules: Generalized association rule extraction is a powerful tool for getting
a rough idea of interesting patterns hidden in data. However, since patterns are extracted at each level
of abstraction, the mined rule sets may be too large to be used effectively for decision-making.
Therefore, in order to discover valuable and interesting knowledge, post-processing steps are often
required. Generalized association rules should have categorical (nominal or discrete) properties on both
the left and right sides of the rule.

3. Quantitative association rules: Quantitative association rules is a special type of association rule.
Unlike general association rules, where both left and right sides of the rule should be categorical
(nominal or discrete) attributes, at least one attribute (left or right) of quantitative association rules must
contain numeric attributes

4- Interval information association rules: Data is first pre-processed by data smoothing and mapping.
Next, interval association rules are generated which involved data partitioning via clustering before the
rules are generated using an Apriori algorithm. Finally, these rules are used to identify data values that
fall outside the expected intervals.

From Association Mining to Correlation Analysis

Most association rule mining algorithms employ a support-confidence framework. Often, many
interesting rules can be found using low support thresholds. Although minimum support and confidence
thresholds help weed out or exclude the exploration of a good number of uninteresting rules, many rules
so generated are still not interesting to the users. Unfortunately, this is especially true when mining at
low support thresholds or mining for long patterns. This has been one of the major bottlenecks for
successful application of association rule mining.

1)Strong Rules Are Not Necessarily Interesting: An Example

Whether or not a rule is interesting can be assessed either subjectively or objectively. Ultimately, only
the user can judge if a given rule is interesting, and this judgment, being subjective, may differ from
one user to another. However, objective interestingness measures, based on the statistics ―behind the
data, can be used as one step toward the goal of weeding out uninteresting rules from presentation to
the user. The support and confidence measures are insufficient at filtering out uninteresting association
rules. To tackle this weakness, a correlation measure can be used to augment the support-confidence
framework for association rules. This leads to correlation rules of the form

A ⇒ B(support, confidence, correlation)

That is, a correlation rule is measured not only by its support and confidence but also by the correlation
between item sets A and B. There are many different correlation measures from which to choose. In
this section, we study various correlation measures to determine which would be good for mining large
data sets

Constraint-Based Association Mining

A data mining process may uncover thousands of rules from a given set of data, most of which end up
being unrelated or uninteresting to the users. Often, users have a good sense of which ―direction‖ of
mining may lead to interesting patterns and the ―form‖ of the patterns or rules they would like to find.
Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to confine
the search space. This strategy is known as constraint-based mining. The constraints can include the
following:

• Knowledge type constraints: These specify the type of knowledge to be mined such as
association or correlation
• Data constraints: These specify the set of task relevant data
• Dimension/level constraints: These specify the desired dimensions or attributes of the data, or
the level of concept hierarchies to be used in mining
• Interestingness constraints: These specify the threshold on statistical measure of rule
interestingness such as support confidence and correlation
• Rule Constraints: These type of the form of rules to be mined. Such constraints may be
expressed as metarules (rule templates) as the minimum or maximum number of predicates that
can occur in the rule antecedent or consequent or as relationships among attributes attribute
values and/or aggregates

You might also like