0% found this document useful (0 votes)
3 views

DM NOTES

The lecture notes cover an extensive syllabus on data mining, detailing its definition, tasks, and processes including data preprocessing, classification, clustering, and association rules. It emphasizes the importance of extracting knowledge from large datasets and discusses the architecture of data mining systems, the data mining process, and major issues in the field. Additionally, it outlines the relationship between data mining and knowledge discovery in databases (KDD).

Uploaded by

Justsharing
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DM NOTES

The lecture notes cover an extensive syllabus on data mining, detailing its definition, tasks, and processes including data preprocessing, classification, clustering, and association rules. It emphasizes the importance of extracting knowledge from large datasets and discusses the architecture of data mining systems, the data mining process, and major issues in the field. Additionally, it outlines the relationship between data mining and knowledge discovery in databases (KDD).

Uploaded by

Justsharing
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 91

LECTURE NOTES ON

DATA MINING
SYLLABUS:
Unit– I

Introduction To Data Mining: Introduction, What Is Data Mining, Definition, KDD,


Challenges, Data Mining Tasks, Data Preprocessing, Data Cleaning, Missing Data,
Dimensionality Reduction, Feature Subset Selection, Discretization and Binarization, Data
Transformation, Measures Of Similarity And Dissimilarity-Basics.

Unit-II

Association Rules: Problem Definition, Frequent Itemsets Generation Association Rule Mining,
The Apriori Principle, Support and Confidence Measures, Association Generation: Apriori
Algorithm, The Partition Algorithms, FP-Growth Algorithms, Compact Representation of
Frequent Item Set-Maximal Frequent Item Set, Closed Frequent Item Set,

Unit – III

Classification: Problem Definition, General Approaches To Solving A Classification Problem,


Evaluation Of Classifiers, Classification Techniques, Decision Trees-Decision Tree
Construction, Methods For Expressing Attribute Test Conditions, Measures For Best Split,
Algorithm For Decision Tree Induction, Naïve-Bayes Classifier, Bayesian Belief Networks,
K-Nearest Neighbor Classification-Algorithms and Characteristics

Unit – IV

Clustering: Problem Definition, Clustering Overview, Evaluation of Clustering Algorithms,


Partition Clustering-K-Means Algorithm, K-Means Additional Issues, PAM Algorithm

Hierarchical Clustering-Agglomerative and Divisive Methods, Basic Agglomerative


Hierarchical Clustering Algorithm, Specific Techniques, Key Issues In Hierarchical Clustering,
Strengths And Weakness: Outlier Detection

Unit – IV

Web And Text Mining: Introduction, Web Mining, Web Content Mining, Web Structure Mining,
We Usage Mining, Text Mining- Unstructured Text, Episode Rule Discovery For Texts,
Hierarchy Of Categories, Text Clustering
Chapter-1

1.1 What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The term is a
misnomer. Thus, data mining should have been more appropriately named as knowledge mining
which emphasizes mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.

The key properties of data mining are


the Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information

Focus on large datasets and databases

1.2 The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default and identifying
segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

1.3 Tasks of Data Mining


Data mining involves six common classes of tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification of


unusual data records, that might be interesting or data errors that require further
investigation.

Association rule learning (Dependency modeling) – Searches for relationships


between variables. For example, a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function that models the data with the least error.
Summarization – providing a more compact representation of the data set,
including visualization and report generation.

1.4 Architecture of Data Mining

A typical data mining system may have the following major components.

1. Knowledge Base:

This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept
hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).

2. Data Mining Engine:

This is essential to the data mining system and ideally consists of functional modules
for tasks such as characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:

This component typically employs interestingness measures to interact with the data
mining modules to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible
into the mining process to confine the search to only the interesting patterns.

4. User interface:

This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory data mining based
on the intermediate data mining results. In addition, this component allows the user to
browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
1.5 Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.


Hence, domain-specific knowledge and experience are usually necessary in order to come
up with a meaningful problem statement. Unfortunately, many application studies tend to
focus on the data-mining technique at the expense of a clear problem statement. In this
step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial hypothesis. There may be several
hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. In successful data-mining applications, this cooperation does not stop in the initial
phase; it continues during the entire data-mining process.

2. Collect the data

This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data-generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and implicitly
given in the data-collection procedure. It is very important, however, to understand how
data collection affects its theoretical distribution since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later
for testing and applying a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully used in a final application
of the results.

3. Preprocessing the data

In the observational setting, data are usually "collected" from existing databases, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:

a. Detect and eventually remove outliers as a part of the preprocessing phase, or


b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight
for further analysis. Also, application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large
spectrum of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.

4. Estimate the model

The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.

5. Interpret the model and draw conclusions

In most cases, data-mining models should help in decision-making. Hence, such models
must be interpretable to be useful because humans are not likely to base their decisions
on complex "black-box" models. Note that the goals of the model's accuracy and its
interpretation's accuracy are somewhat contradictory. Usually, simple models are more
interpretable, but they are also less accurate. Modern data mining methods are expected
to yield highly accurate results using high-dimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific
techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for
successful decision-making.

The Data Mining Process

1.6 Classification of Data Mining Systems:

The data mining system can be classified according to the following criteria:

Database
Technology Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Some Other Classification Criteria:

Classification according to kind of databases mined


Classification according to kind of knowledge
mined
Classification according to kinds of techniques utilized
Classification according to applications adapted

Classification according to the kind of databases mined

We can classify the data mining system according to the kind of databases mined. Database
systems can be classified according to different criteria such as data models, types of data, etc.
And the data mining system can be classified accordingly. For example, if we classify the
database according to the data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.

Classification according to the kind of knowledge mined

We can classify the data mining system according to the kind of knowledge mined. It means
data mining systems are classified based on functionalities such as:

Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Classification according to the kinds of techniques utilized

We can classify the data mining system according to the kind of techniques used. We can
describe these techniques according to the degree of user interaction involved or the methods
of analysis employed.

Classification according to applications adapted

We can classify the data mining system according to the application adapted. These applications
are as follows:

Finance
Telecommunications
DNA
Stock Markets
E-mail

1.7 Major Issues in Data Mining:

Mining different kinds of knowledge in databases. - The needs of different users are not the
same. Different users may be interested in different kinds of knowledge. Therefore, it is
necessary for data mining to cover a broad range of knowledge discovery tasks.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide the discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple levels of
abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered
they need to be expressed in high-level languages, and visual representations. This representation
should be easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, and incomplete objects while mining the data regularities. If data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to the interestingness of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - To effectively extract the information
from huge amounts of data in databases, data mining algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms. - Factors such as the huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which are further processed in parallel. Then the results from the partitions
are merged. The incremental algorithms update databases without having to mine the data
again from scratch.
Knowledge Discovery in Databases (KDD)
Some people treat data mining the same as Knowledge discovery while some people view
data mining essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process:

Data Cleaning - In this step the noise and inconsistent data are removed.
Data Integration - In this step, multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into
forms appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step, intelligent methods are applied to extract data patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery process:

Architecture of KDD

The algorithms used for summarization, include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
The mapping from the operational environment to the data warehouse includes source
databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules, and defaults, data refresh and purging rules, and security
(user authorization and access control).

Data related to system performance, which include indices and profiles that improved
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.

Business metadata, which includes business terms and definitions, data


ownership information, and charging policies.

1.9. Data Preprocessing:

Data Integration:

It combines data from multiple sources into a coherent data store, as in data warehousing. These
sources may include multiple databases, data cubes, or flat files.

The data integration systems are formally defined as triple<G, S,

M> Where G: The global schema

S: Heterogeneous source of schemas

M: Mapping between the queries of source and global schema.


Issues in Data Integration:

1. Schema integration and object matching:

How can the data analyst or the computer be sure that the customer id in one database
and the customer number in another reference to the same attribute?

2. Redundancy:

An attribute (such as annual revenue, for instance) may be redundant if it can be derived
from another attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set.

3. detection and resolution of data conflicts:

For the same real-world entity, attribute values from different sources may differ.

Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results.
Strategies for data reduction include the following:

Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.

Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or


dimensions may be detected and removed.

Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need to store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity
reduction that is very useful for the automatic generation of concept hierarchies.
Discretization and concept hierarchy generation are powerful tools for data mining, in that
they allow the mining of data at multiple levels of abstraction.

Dimensionality Reduction:
Dimensionality reduction is a data preprocessing technique used in data mining to reduce the number
of input variables (features or dimensions) in a dataset while preserving as much relevant information
as possible. It is particularly useful when dealing with high-dimensional datasets, where many
features may be redundant, irrelevant, or noisy.
Why Dimensionality Reduction is Important
1. Improves Model Performance:
o Reduces overfitting by eliminating irrelevant features.
o Enhances algorithm performance by decreasing computation time.
2. Simplifies Data Visualization:
o Helps in visualizing high-dimensional data by projecting it onto 2D or 3D spaces.
3. Reduces Storage Requirements:
o Lessens memory and storage demands by removing redundant data.
4. Eases Interpretability:
o Simplifies complex datasets, making the insights more understandable.
Techniques for Dimensionality Reduction
1. Feature Selection
 Selects a subset of relevant features while discarding irrelevant ones.
 Common methods:
o Filter Methods: Use statistical techniques (e.g., correlation, chi-square) to rank
features.
o Wrapper Methods: Use model-based evaluation (e.g., forward or backward
selection).
o Embedded Methods: Combine feature selection with model training (e.g., Lasso
regression).
2. Feature Extraction
 Creates new features by combining or transforming existing ones.
 Common techniques:
o Principal Component Analysis (PCA):
 Projects data onto a lower-dimensional space by finding directions (principal
components) that maximize variance.
 Retains most of the dataset’s variability.
o Linear Discriminant Analysis (LDA):
 Maximizes separability among classes by finding linear combinations of
features.
o t-SNE (t-Distributed Stochastic Neighbor Embedding):
 Non-linear dimensionality reduction method focused on preserving local
relationships in data.
o Autoencoders:
 Neural network-based technique for non-linear feature extraction, often used
in deep learning.
o Independent Component Analysis (ICA):
 Focuses on maximizing statistical independence among extracted components.
3. Clustering-Based Reduction
 Groups similar features or data points and replaces them with representative features.
 Techniques: K-Means, Hierarchical Clustering.
4. Manifold Learning
 Maps high-dimensional data to lower dimensions while preserving its structure.
 Techniques: Isomap, Locally Linear Embedding (LLE).
Applications
1. Data Visualization:
o Simplify complex datasets for 2D/3D visual exploration.
o Example: Visualizing customer segments in marketing.
2. Preprocessing for Machine Learning:
o Reduce computation time and enhance model accuracy.
o Example: Simplifying datasets for training machine learning models.
3. Noise Reduction:
o Eliminate noisy or irrelevant data for better analysis.
o Example: Removing redundant features in sensor data.
4. Genomics:
o Analyze large-scale gene expression data with fewer variables.
o Example: Identifying significant genes related to diseases.
Advantages of Dimensionality Reduction
1. Reduces Computational Costs:
o Less memory and processing power required.
2. Improves Model Generalization:
o Reduces overfitting by eliminating irrelevant dimensions.
3. Simplifies Data Analysis:
o Helps focus on the most important features.
Limitations
1. Loss of Information:
o Reduction might discard critical data, impacting accuracy.
2. Interpretability:
o Transformed features (e.g., PCA components) may lack interpretability.
3. Computational Complexity:
o Some methods (e.g., t-SNE) can be computationally expensive.
4. Parameter Sensitivity:
o Results can vary significantly based on chosen parameters (e.g., number of
components in PCA).

Dimensionality reduction is an essential technique in data mining, enabling effective analysis and
model building, especially with high-dimensional datasets. Choosing the right method depends on
the dataset characteristics and the goals of the analysis.
Discretization and Binarization:
Discretization and binarization are preprocessing techniques in data mining that transform
continuous or categorical data into forms that are easier to analyse, especially for certain algorithms
like decision trees, rule-based systems, or clustering.
1. Discretization
Definition:
Discretization converts continuous data (e.g., numerical values like age, and temperature) into
discrete intervals or categories (e.g., "young," "middle-aged," "old").
Types of Discretization
1. Unsupervised Discretization:
o Does not use class labels and divides the data based purely on numerical distribution.
o Techniques:
 Equal Width Binning:
 Divides the range of values into intervals of equal size.
 Example:
 Age: [18–30), [30–50), [50–70).
 Equal Frequency Binning:
 Divides the range such that each bin contains approximately the same
number of data points.
 Example:
 For 100 records: [18–25), [25–35), [35–50).
2. Supervised Discretization:
o Considers class labels when creating bins to maximize information gain or minimize
entropy.
o Techniques:
 Entropy-Based Discretization:
 Uses class information to determine the optimal intervals.
 Example: Dividing a continuous variable where intervals maximize the
separation of class labels.
 ChiMerge:
 Merges adjacent intervals based on the chi-squared test to group values
with similar distributions.
 Advantages of Discretization
 Reduces Complexity:
o Simplifies data representation by converting continuous variables into a smaller set of
discrete categories.
 Improves Interpretability:
o Categories (e.g., "low," "medium," "high") are easier to interpret than raw numerical
values.
 Compatibility with Algorithms:
o Some algorithms work better with discrete data (e.g., Naive Bayes, decision trees).
Limitations of Discretization
 Loss of Information:
o Aggregating data into intervals can discard fine-grained details.
 Parameter Sensitivity:
o Results depend on the choice of intervals or binning methods.
 Boundary Issues:
o Data near bin edges may be misclassified or lose meaningful comparisons.
o

2. Binarization
Definition:
Binarization converts categorical or numerical data into binary (0 or 1) values, representing the
presence or absence of a feature.

Types of Binarization:
1. For Numerical Data:
o Threshold-Based Binarization:
 A threshold value is chosen, and values above the threshold are set to 1, while
values below are set to 0.
 Example:
 Income > $50,000 → 1 (High); Income ≤ $50,000 → 0 (Low).
2. For Categorical Data:
o One-Hot Encoding:
 Converts a categorical variable with kk categories into kk binary variables.
 Example:
 Colour: {Red, Green, Blue} → (1, 0, 0), (0, 1, 0), (0, 0, 1).
o Binary Encoding:
 Assigns each category a unique binary code.
 Example:
 Categories: {A, B, C, D} → {00, 01, 10, 11}.
Advantages of Binarization
 Simplifies Data Representation:
o Binary values are straightforward and work well with many machine-learning
algorithms.
 Compatibility with Algorithms:
o Algorithms like SVMs and neural networks often prefer binary inputs.
 Removes Ordinality Issues:
o One-hot encoding prevents algorithms from assuming an ordinal relationship among
categories.
Limitations of Binarization
 Increased Dimensionality:
o One-hot encoding for categorical variables with many unique values (e.g., city names)
can lead to very high-dimensional data.
 Loss of Information:
o Binary representations may oversimplify relationships (e.g., threshold-based
binarization discards magnitude information).
 Memory and Computation Cost:
o More binary features increase the computational load.
Applications in Data Mining
1. Discretization:
o Often used in algorithms requiring discrete inputs, like Naive Bayes or rule-based
classifiers.
o Useful for visualizing patterns in numerical data.
2. Binarization:
o Essential for methods that rely on binary input, such as association rule mining and
some clustering algorithms.
o Common in text mining (e.g., presence/absence of keywords).
Comparison
Aspect Discretization Binarization
Input Data Continuous Continuous or Categorical
Output Data Discrete categories Binary (0 or 1)
Reducing complexity of continuous
Use Case Representing data in binary form for models
variables
Impact on Often increases dimensionality (e.g., one-hot
May reduce dimensionality
Features encoding)

Discretization and binarization are critical for preparing data for specific algorithms, enabling better
performance and interpretability. However, their use depends on the dataset characteristics and the
goals of the analysis.

Data Transformation:

In data transformation, the data are transformed or consolidated into forms appropriate for mining.

Data transformation can involve the following:

Smoothing, which works to remove noise from the data. Such techniques include
binding, regression, and clustering.

Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated to compute monthly and annual total
amounts. This step is typically used in constructing a data cube to analyze the data at
multiple granularities.
Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced by
higher-level concepts through concept hierarchies. For example, categorical attributes,
like streets, can be generalized to higher-level concepts, like city or country.
Normalization, where the attribute data are scaled to fall within a small specified range,
such as 1:0 to 1:0, or 0:0 to 1:0.

Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.

Data Transformation:

Data transformation in data mining refers to converting data into an appropriate format or structure
to make it more suitable for analysis or to enhance the performance of mining algorithms. It plays a
critical role in preparing raw data for mining tasks such as classification, clustering, and association
rule mining.

Why is Data Transformation Important?


 Improves Data Quality: Ensures consistency, accuracy, and compatibility across datasets.
 Simplifies Analysis: Reduces complexity by converting data into forms that algorithms can
process effectively.
 Enhances Model Performance: Helps mining algorithms operate efficiently, especially
when they have specific data requirements.
 Supports Better Insights: Ensures that transformed data reflects meaningful relationships
and patterns.
Key Techniques in Data Transformation
1. Normalization
 Definition: Scaling numerical data to fall within a specific range (e.g., [0, 1] or [-1, 1]).
 Purpose: Ensures all features contribute equally to the analysis, avoiding dominance by
features with larger ranges.
 Techniques:
o Min-Max Normalization: xnorm=x−min(x)max(x)−min(x)x_{\text{norm}} = \
frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}
o Z-Score Normalization: z=x−μσz = \frac{x - \mu}{\sigma} where μ\mu is the mean
and σ\sigma is the standard deviation.
o Decimal Scaling:
 Scales data by moving the decimal point.
 Example: Divide by 10, 100, etc., based on the maximum absolute value.
2. Aggregation
 Definition: Summarizing or combining data.
 Purpose: Reduces data size and highlights higher-level patterns.
 Example:
o Aggregating daily sales data into monthly totals.
3. Discretization
 Definition: Converting continuous data into discrete intervals or categories.
 Purpose: Simplifies analysis and enables compatibility with algorithms requiring discrete
inputs.
 Techniques:
o Equal Width Binning
o Equal Frequency Binning
o Entropy-based binning (supervised discretization)
4. Binarization
 Definition: Converting data into binary (0 or 1) format.
 Purpose: Encodes categorical or numerical data for binary-based algorithms.
 Example:
o Threshold-based binarization for numerical data: Income > $50,000 → 1; otherwise
→ 0.
o One-hot encoding for categorical data.
5. Feature Construction
 Definition: Creating new features from existing ones by combining or transforming them.
 Purpose: Enhances the quality and relevance of features for mining tasks.
 Example:
o Creating a “BMI” feature from weight and height.
o Combining “date” and “time” into a “timestamp.”
6. Smoothing
 Definition: Reducing noise in data to improve pattern detection.
 Purpose: Prepares data for more accurate analysis.
 Techniques:
o Bin Smoothing: Grouping values into bins and replacing them with bin means or
medians.
o Moving Average: Replacing values with the average of their neighbors.
7. Data Integration
 Definition: Combining data from multiple sources into a cohesive dataset.
 Purpose: Prepares data for unified analysis, eliminating redundancy and inconsistency.
 Challenges:
o Resolving schema conflicts.
o Handling missing values across datasets.
8. Principal Component Analysis (PCA)
 Definition: A dimensionality reduction technique that transforms data into a new set of
variables (principal components).
 Purpose: Reduces the number of features while preserving variability.
 Use Case: High-dimensional datasets where some features are correlated.
9. Data Scaling
 Definition: Adjusting the scale of data to fit the requirements of specific algorithms.
 Purpose: Prevents algorithms from being biased toward features with larger magnitudes.
 Examples:
o Scaling pixel values in images to [0, 1].
o Standardizing financial data across currencies.
10. Data Encoding
 Definition: Transforming categorical data into numerical forms.
 Purpose: Makes categorical data compatible with machine learning algorithms.
 Techniques:
o One-hot encoding.
o Label encoding.
o Frequency encoding.
Applications of Data Transformation
1. Classification:
o Normalize or discretize data to improve algorithm performance.
2. Clustering:
o Use PCA to reduce dimensions for clustering in lower-dimensional space.
3. Association Rule Mining:
o Binarize data to identify frequent itemsets.
4. Visualization:
o Aggregate and normalize data for clear visual insights.
5. Anomaly Detection:
o Transform data to detect deviations from normal patterns.
Advantages of Data Transformation
1. Improved Model Accuracy: Better-pre-processed data leads to more accurate and robust
models.
2. Reduced Computational Complexity: Aggregation and normalization streamline data
processing.
3. Compatibility with Algorithms: Prepares data to meet the input requirements of specific
mining techniques.
4. Enhanced Interpretability: Smoothing and feature construction make patterns more
understandable.
Limitations of Data Transformation
1. Information Loss: Discretization or dimensionality reduction may discard valuable details.
2. Bias Introduction: Improper scaling or encoding might skew results.
3. Increased Preprocessing Time: Some transformations (e.g., PCA) can be computationally
expensive.
4. Dependency on Domain Knowledge: Feature construction requires an understanding of the
data context.

Similarity and Dissimilarity Measure


In the unexpectedly evolving field of statistics technological know-how, the degree capability of how
alike or exclusive information points is plays a critical role in several packages, consisting of clustering,
type, and data retrieval. Similarity and dissimilarity measures provide the mathematical foundation for
these responsibilities, permitting algorithms to interpret and analyze complicated datasets successfully.
This article delves into the numerous similarity and dissimilarity measures, highlighting their significance
and applications in records technology.

Similarity Measures:
Similarity measures are fundamental tools in data technological know-how, enabling us to quantify
how alike two information factors are. These measures are pivotal in various applications together
with clustering, category, and information retrieval. In this newsletter, we will discover a number of
the most typically used similarity measures, their formulas, descriptions, and usual packages.
1. Euclidean Distance
Formula:

Description: Euclidean distance is the direct distance among points in a multi-dimensional space. It is
intuitive and extensively used in lots of applications, especially whilst the functions are non-stop and
the size is steady across dimensions.
Applications: It is commonly used in clustering algorithms together with the okay-method and in
nearest-neighbor searches.
2. Cosine Similarity
Formula:

Description: Cosine similarity measures the cosine of the perspective between vectors. It is
specifically useful in excessive-dimensional areas, which include textual content mining, in which it
measures the orientation in place of significance, making it scale-invariant.
Applications: Widely utilized in text mining and information retrieval, which include record
similarity in serps.
3. Jaccard Similarity
Formula:
Description: Jaccard similarity measures the similarity among two finite sets with the aid of dividing
the dimensions of their intersection via the dimensions of their union. It is beneficial for comparing
specific records.
Applications: Commonly used in clustering and classification tasks regarding categorical statistics,
consisting of market basket evaluation.
4. Pearson Correlation Coefficient

Description: Pearson correlation measures the linear correlation among two variables, supplying a
value between -1 and 1. It assesses how nicely a change in a single variable predicts a trade in some
other.
Applications: Used in statistical evaluation and system studying to discover and quantify linear
relationships between features.
5. Hamming Distance
Formula:

Description: Hamming distance measures the number of positions at which the corresponding factors
of strings are one-of-a-kind. It is especially useful for binary or specific information.
Applications: Used in mistake detection and correction algorithms, in addition to comparing binary
sequences or express variables.
Applications of Similarity Measures:
Similarity measures are pivotal in numerous information technological know-how packages,
enabling algorithms to institution, classify, and retrieve records based totally on how alike the facts
points are. This functionality is essential in fields starting from textual content mining to image
popularity. Here, we discover some key packages of similarity measures.
1. Clustering
Clustering entails grouping a set of gadgets such that items in the identical institution (or cluster) are
greater just like every aside from to the ones in different agencies. Similarity measures play an
essential function in defining these groups.
o K-Means Clustering: Uses Euclidean distance to partition information into ok clusters. Each
facts factor is assigned to the cluster with the nearest centroid.
o Hierarchical Clustering: Uses diverse distance metrics (e.G., Euclidean, Manhattan) to
construct a hierarchy of clusters, often visualized as a dendrogram.
o Text Clustering: Uses cosine similarity to organization documents with comparable content
material. This is mainly beneficial in organizing big textual content corpora.
2. Classification
Classification assigns a label to a brand new facts factor based totally at the traits of acknowledged
classified facts points. Similarity measures help decide the label by means of comparing the new
factor to present points.
o K-Nearest Neighbors (k-NN): Classifies a statistics factor primarily based on the majority
label among its ok nearest acquaintances, frequently the usage of Euclidean distance or cosine
similarity.
o Document Classification: Uses similarity measures like cosine similarity to categorize text
files into predefined instructions.
3. Information Retrieval
Information retrieval structures, together with search engines, rely on similarity measures to rank
documents primarily based on their relevance to a query.
o Search Engines: Use cosine similarity to evaluate the question vector with report vectors,
ranking documents by using their similarity to the query.
o Content-Based Filtering: In advice systems, similarity measures (e.g., cosine similarity,
Jaccard similarity) are used to recommend gadgets that might be much like those a user has
previously favored.
4. Recommendation Systems
Recommendation structures suggest items to customers based on their alternatives and behavior,
often the usage of similarity measures to discover objects or customers that might be alike.
o Collaborative Filtering: Uses similarity measures like Pearson correlation or cosine similarity
to locate customers with similar preferences and propose items they've liked.
o Content-Based Filtering: Recommends items similar to those the person has shown interest
in, the use of measures like cosine similarity to examine object capabilities.
o

5. Anomaly Detection
Anomaly detection identifies outliers or uncommon statistics points that differ substantially from the
bulk of information.
o Mahalanobis Distance: Considers the correlations of the dataset to stumble on multivariate
outliers.
o Euclidean Distance: Can be used in easier contexts to locate information factors that are away
from the imply or median of the dataset.
6. Natural Language Processing (NLP)
In NLP, similarity measures are used to examine text data, assisting in responsibilities consisting of
report clustering, plagiarism detection, and sentiment evaluation.
o Word Embeddings: Use cosine similarity to evaluate phrase vectors in fashions like
Word2Vec or GloVe, enabling the identity of semantically comparable words.
o Document Similarity: Measures like cosine similarity assist in clustering files or detecting
plagiarism by comparing text content.
7. Image Processing
Image processing involves analyzing and manipulating pics, where similarity measures are used to
compare picture capabilities.
o Image Retrieval: Uses measures like Euclidean distance on characteristic vectors (e.g., colour
histograms, side descriptors) to discover similar photographs.
o Face Recognition: Employs measures like cosine similarity on feature vectors extracted from
deep studying fashions to become aware of or verify people.
8. Bioinformatics
In bioinformatics, similarity measures help examine organic information, along with genetic
sequences or protein systems.
o Sequence Alignment: Uses Hamming distance to compare DNA, RNA, or protein sequences,
figuring out similarities and variations that could imply evolutionary relationships.
o Protein Structure Comparison: Employs measures like RMSD (Root Mean Square Deviation)
to evaluate 3-D systems of proteins, aiding within the examination of their functions and
interactions.
Dissimilarity Measures
Dissimilarity measures, frequently known as distance metrics, are crucial in data technological know-
how for quantifying the difference among statistics points. These measures assist in obligations
including clustering, type, anomaly detection, and lots of more. By knowledge how unique two
statistics factors are, algorithms can higher arrange, classify, and examine information. Here, we
discover some of the maximum typically used dissimilarity measures, their formulas, descriptions,
and regular programs.
1. Euclidean Distance

o Description: Euclidean distance is the "instantly-line" distance between two factors in a


multi-dimensional area. It is intuitive and widely used, particularly while the scale of the
records are on a comparable scale.
o Applications: Frequently utilized in clustering algorithms like ok-approach, and in nearest
neighbor searches.
2. Manhattan Distance (L1 norm)

o Description: Also referred to as the taxicab or town block distance, Manhattan distance
measures the space among two points by using summing absolutely the variations in their
coordinates. It is useful for high-dimensional information and while the facts dimensions
aren't on the same scale.
o Applications: Used in clustering, in particular while coping with excessive-dimensional
spaces or facts with differing scales.
3. Hamming Distance
o Description: Hamming distance measures the variety of positions at which the corresponding
factors of strings are one of a kind. It is commonly used for categorical facts or binary strings.
o Applications: Common in errors detection and correction algorithms, including in coding
theory and for evaluating binary sequences.
4. Mahalanobis Distance

o Description: Mahalanobis distance measures the gap among a factor and a distribution,
considering the correlations of the information set. It is scale-invariant and useful for
identifying outliers.
o Applications: Used in multivariate anomaly detection, clustering, and category
responsibilities.
5. Chebyshev Distance

o The difference between any unmarried dimension of two data points. It is useful in
eventualities where the most deviation is of interest.
o Applications: Used in certain satisfactory manipulation processes and for programs where the
biggest single difference is the most vital thing.
Applications:
Dissimilarity measures are critical in records science, providing a way to quantify the variations
between information points. These measures are extensively used in numerous packages, from
clustering and classification to anomaly detection and bioinformatics. Here, we explore numerous
key applications of dissimilarity measures.
1. Clustering
In clustering, dissimilarity measures help to outline the boundaries of clusters by quantifying how
exclusive statistics points are from each other.
o K-Means Clustering: Uses Euclidean distance to assign statistics factors to the closest cluster
centroid. Each record factor is assigned to the cluster whose implies yields the least within-
cluster sum of squares.
o Hierarchical Clustering: Can use various distance metrics together with Euclidean,
Manhattan, or Chebyshev distances to build a hierarchy of clusters. The choice of distance
metric can appreciably affect the form and meaning of the ensuing clusters.
2. Classification
Dissimilarity measures help in class responsibilities by way of figuring out the distinction among
records factors, which is vital for assigning labels.
K-Nearest Neighbors (ok-NN): Uses dissimilarity measures like Euclidean distance to categorize a
statistics factor primarily based at the labels of its nearest associates. The facts factor is assigned to
the magnificence most commonplace among its ok nearest buddies.
3. Anomaly Detection
Anomaly detection involves figuring out record points that deviate notably from the norm.
Dissimilarity measures help quantify those deviations.
o Mahalanobis Distance: Effective in multivariate anomaly detection as it considers the
correlations among variables. Points that have an excessive Mahalanobis distance from the
mean are considered outliers.
o Euclidean and Chebyshev Distances: Used to pick out outliers by way of measuring the
distance from the suggestion or different important points in the facts.
4. Information Retrieval
In records retrieval, dissimilarity measures help rank gadgets based on their differences from a query,
helping in the retrieval of the maximum applicable records.
o Euclidean Distance: Can be used to degree the distinction among user possibilities and item
features in advice systems, assisting to signify gadgets that are distinctive from those the
consumer has already seen.
o Hamming Distance: Used in text retrieval to measure the distinction between binary or
express data, which include keywords or tags.
5. Image Processing
In photo processing, dissimilarity measures compare and examine photo functions, which is crucial
for tasks that include photo retrieval and reputation.
o Euclidean Distance: Used in photo retrieval systems to find pix which are visually one-of-a-
kind based on feature vectors, which include color histograms or texture patterns.
o Hamming Distance: Employed in evaluating binary image descriptors, together with those
used in fingerprint matching or optical man or woman reputation.
6. Bioinformatics
In bioinformatics, dissimilarity measures are used to compare organic data, which include genetic
sequences or protein systems, that's critical for understanding biological capabilities and
relationships.
o Hamming Distance: Used in collection alignment to evaluate DNA, RNA, or protein
sequences, helping to discover mutations or evolutionary relationships.
o Euclidean and Mahalanobis Distances: Used to examine protein structures and other high-
dimensional biological statistics, assisting in the study of molecular features and interactions.
7. Quality Control
In manufacturing and pleasant manipulation, dissimilarity measures are used to stumble on
deviations from the same old or predicted product traits.
Chapter-2

2.1 Association Rule Mining:


Association rule mining is a popular and well-researched method for discovering
interesting relations between variables in large databases.
It is intended to identify strong rules discovered in databases using different measures of
interestingness.
Based on the concept of strong rules, Rakesh Agrawal et al. introduced association rules.
Problem Definition:
The problem of association rule mining is defined as:

Let be a set of binary attributes called items.

Let be a set of transactions called the database.


Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form
where and .
The sets of items (for short item sets) and are called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning that if


butter and bread are bought, customers also buy milk.
Example database with 4 items and 5 transactions

Transaction ID milk bread butter Beer


1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

2.1.1 Important concepts of Association Rule Mining:

The support of an itemset is defined as the proportion of transactions in the


data set which contain the itemset. In the example database, the itemset

has a support of since it occurs in 20% of


all transactions (1 out of 5 transactions).

The confidenceof a rule is defined

For example, the rule has confidence of

in the database, which means that for 100% of the transactions


containing butter and bread, the rule is correct (100% of the times a customer buys butter
and bread, milk is bought as well). Confidence can be interpreted as an estimate of the

probability , the probability of finding the RHS of the rule in transactions


under the condition that these transactions also contain the LHS.

The life a rule is defined as


or the ratio of the observed support to that expected if X and Y were independent. The

rule has a lift of .

The conviction of a rule is defined as

The rule has a conviction of ,

and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.

2.2 Market basket analysis:

This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently
purchased together by customers. For instance, if customers are buying milk, how likely are they
to also buy bread (and what kind of bread) on the same trip to the supermarket? Such
information can lead to increased sales by helping retailers do selective marketing and plan their
shelf space.
Example:

If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items. In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance,
after deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software and may decide to purchase a
home security system as well. Market basket analysis can also help retailers plan which items to
put on sale at reduced prices. If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale of printers as well as computers.

2.3 Frequent Item set Generation:

Frequent pattern mining can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined:

We can mine the complete set of frequent itemsets, the closed frequent itemsets, and
the maximal frequent itemsets, given a minimum support threshold.

We can also mine constrained frequent itemsets, approximate frequent itemsets,


near-match frequent itemsets, top-k frequent itemsets, and so on.

2. Based on the levels of abstraction involved in the rule set:

Some methods for association rule mining can find rules at differing levels of abstraction.

For example, suppose that a set of association rules mined includes the following
rules where X is a variable representing a customer:

buys(X, ―computer‖))=>buys(X, ―HP printer‖) (1)

buys(X, ―laptop computer‖)) =>buys(X, ―HP printer‖) (2)

In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
―computer‖ is a higher-level abstraction of ―laptop computer‖).
3. Based on the number of data dimensions involved in the rule:

If the items or attributes in an association rule reference only one dimension, then it is
a single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)

If a rule references two or more dimensions, such as the dimensions of age, income, and
buys, then it is a multidimensional association rule. The following rule is an example of a
multidimensional rule:
Age (X, ―30,31…39‖) ^ income (X, ―42K,48K‖))=>buys(X, ―high resolution TV‖)
4. Based on the types of values handled in the rule:

If a rule involves associations between the presence or absence of items, it is a


Boolean association rule.

If a rule describes associations between quantitative items or attributes, then it is


a quantitative association rule.

5. Based on the kinds of rules to be mined:

Frequent pattern analysis can generate various kinds of rules and other interesting
relationships.

Association rule mining can generate a large number of rules, many of which are
redundant or do not indicate a correlation relationship among itemsets.

The discovered associations can be further analyzed to uncover statistical


correlations, leading to correlation rules.

6. Based on the kinds of patterns to be mined:


Many kinds of frequent patterns can be mined from different kinds of data sets.

Sequential pattern mining searches for frequent subsequences in a sequence data set,
where a sequence records an ordering of events.

For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digital camera, and then a memory card.

Structured pattern mining searches for frequent substructures in a structured data


set. Single items are the simplest form of structure.
Each element of an itemset may contain a subsequence, a subtree, and so on.

Therefore, structured pattern mining can be considered the most general form of frequent
pattern mining.
2.4 Efficient Frequent Itemset Mining Methods:

Finding Frequent Itemsets Using Candidate Generation: The Apriori Algorithm

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for


mining frequent itemsets for Boolean association rules.

The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.

First, the set of frequent 1-item sets is found by scanning the database to accumulate the
count for each item and collecting those items that satisfy minimum support. The
resulting set is denoted L1. Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lk requires one full scan of the database.
A two-step process is followed in Apriori consisting of join and prune action.
Example:

TID List of item IDs


T10 I1, I2, I5
0
T20 I2, I4
0
T30 I2, I3
0
T40 I1, I2, I4
0
T50 I1, I3
0
T60 I2, I3
0
T70 I1, I3
0
T80 I1, I2, I3, I5
0
T90 I1, I2, I3
0
There are nine transactions in this database, that is, |D| = 9.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
to generate a candidate set of 2-itemsets, C2.No candidates are removed from during the prune
step because each subset of the candidates is also frequent.
4. Next, the transactions are scanned and the support count of each candidate item is
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets ,C3, From the join step, we first getC3
=L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on
the Apriori property that all subsets of a frequent itemset must also be frequent, we can
determine that the four latter candidates cannot be frequent.

7.The transactions in D are scanned to determine L3, consisting of candidate 3-itemsets in


C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
2.4.2 Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.
Example:

2.5 Mining Multilevel Association Rules:

For many applications, it is difficult to find strong associations among data items at
low or primitive levels of abstraction due to the sparsity of data at those levels.

Strong associations discovered at high levels of abstraction may represent


commonsense knowledge.

Therefore, data mining systems should provide capabilities for mining association
rules at multiple levels of abstraction, with sufficient flexibility for easy traversal
among different abstraction spaces.
Association rules generated from mining data at multiple levels of abstraction are called
multiple-level or multilevel association rules.

Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.

A top-down strategy is employed, where counts are accumulated to calculate frequent


itemsets at each concept level, starting at concept level 1 and working downward in the
hierarchy toward the more specific concept levels, until no more frequent itemsets can be
found.

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts. Data can be generalized by replacing low-level concepts within the
data with their higher-level concepts, or ancestors, from a concept hierarchy.
The concept hierarchy has five levels, respectively referred to as levels 0 to 4, starting with
level 0 at the root node for all.

• Level 1 includes a computer, software, printer & camera, and computer accessory.
• Level 2 includes laptop computers, desktop computers, office software, and antivirus
software
• Level 3 includes IBM desktop computers, Microsoft Office software, and so on.
• Level 4 is the most specific abstraction level of this hierarchy.

Approaches For Mining Multilevel Association Rules:

1. Uniform Minimum Support:


The same minimum support threshold is used when mining at each level of abstraction.
When a uniform minimum support threshold is used, the search procedure is simplified.
The method is also simple in that users are required to specify only one minimum support
threshold.
The uniform support approach, however, has some difficulties. It is unlikely that items at
lower levels of abstraction will occur as frequently as those at higher levels of
abstraction.
If the minimum support threshold is set too high, it could miss some meaningful
associations occurring at low abstraction levels. If the threshold is set too low, it may
generate many uninteresting associations occurring at high abstraction levels.
2. Reduced Minimum Support:
Each level of abstraction has its own minimum support threshold.
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, ―computers,‖ ―laptop computers,‖ and ―desktop computer‖ are all
considered frequent.

3. Group-Based Minimum Support:


Because users or experts often have insight as to which groups are more important than
others, it is sometimes more desirable to set up user-specific, item, or group based minimal
support thresholds when mining multilevel rules.
For example, a user could set up the minimum support thresholds based on product price, or
on items of interest, such as by setting particularly low support thresholds for laptop
computersand flash drives in order to pay particular attention to the association patterns
containing items in these categories.

2.6 Mining Multidimensional Association Rules from Relational Databases and


Data Warehouses:

Single-dimensional or intra-dimensional association rule contains a single distinct


predicate (e.g., buys) with multiple occurrences i.e., the predicate occurs more than once
within the rule.

Buys (X, ―digital camera‖) =>buys (X, ―HP printer‖)

Association rules that involve two or more dimensions or predicates can be referred
to as multidimensional association rules.
Age (X, “20…29”) ^occupation (X, “student”) =>buys (X, “laptop”)

The above Rule contains three predicates (age, occupation, and buys), each of which
occurs only once in the rule. Hence, we say that it has no repeated predicates.

Multidimensional association rules with no repeated predicates are called


interdimensional association rules.

We can also mine multidimensional association rules with repeated predicates, which
contain multiple occurrences of some predicates. These rules are called hybrid-
dimensional association rules. An example of such a rule is the following, where the
predicate buys is repeated:
age(X, ―20…29‖)^buys(X, ―laptop‖)=>buys(X, ―HP printer‖)

2.7 Mining Quantitative Association Rules:


Quantitative association rules are multidimensional association rules in which the numeric
attributes are dynamically discretized during the mining process to satisfy some mining
criteria, such as maximizing the confidence or compactness of the rules mined.
In this section, we focus specifically on how to mine quantitative association rules having
two quantitative attributes on the left-hand side of the rule and one categorical attribute on
the right-hand side of the rule. That is
Aquan1 ^Aquan2 =>Acat
whereAquan1 and Aquan2 are tests on quantitative attribute interval
Acat tests a categorical attribute from the task-relevant data.
Such rules have been referred to as two-dimensional quantitative association rules
because they contain two quantitative dimensions.
For instance, suppose you are curious about the association relationship between pairs
of quantitative attributes, like customer age and income, and the type of television (such
as high-definition TV, i.e., HDTV) that customers like to buy.
An example of such a 2-D quantitative association rule is
age(X, ―30…39‖)^income(X, ―42K…48K‖)=>buys(X, ―HDTV‖)
2.8 From Association Mining to Correlation Analysis:

A correlation measure can be used to augment the support-confidence framework


for association rules. This leads to correlation rules of the form
A=>B [support, confidence, correlation]

That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B. There are many different correlation measures
from which to choose. In this section, we study various correlation measures to determine
which would be good for mining large data sets.

Lift is a simple correlation measure that is given as follows. The occurrence of


itemset A is independent of the occurrence of itemset B if = P(A)P(B);
otherwise, itemsets A and B are dependent and correlated as events. This definition
can easily be extended to more than two itemsets.

The lift between the occurrence of A and B can be measured by computing

If the lift(A,B) is less than 1, then the occurrence of A is negatively correlated with
the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that
the occurrence of one implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.
Chapter-3

3.1 Classification and Prediction:

Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends.
Classification predicts categorical (discrete, unordered) labels and prediction models
continuous-valued functions.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a prediction model to predict the expenditures of potential
customers on computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory residents, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.

3.1.1 Issues Regarding Classification and Prediction:

1. Preparing the Data for Classification and Prediction:


The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.
(i) Data cleaning:
This refers to the preprocessing of data to remove or reduce noise (by applying smoothing
techniques) and the treatment of missing values (e.g., by replacing a missing value with the
most commonly occurring value for that attribute, or with the most probable value based on
statistics).
Although most classification algorithms have some mechanisms for handling noisy or
missing data, this step can help reduce confusion during learning.
(ii) Relevance analysis:
Many of the attributes in the data may be redundant.
Correlation analysis can be used to identify whether any two given attributes are
statistically related.
For example, a strong correlation between attributes A1 and A2 would suggest that one of
the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained
using all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or
prediction task.
Such analysis can help improve classification efficiency and scalability.
(iii) Data Transformation and Reduction
The data may be transformed by normalization, particularly when neural networks or
methods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a
small specified range, such as -1 to +1 or 0 to 1.
The data can also be transformed by generalizing it to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for
continuous-valued attributes.
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like streets, can
be generalized to higher-level concepts, like cities.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principal components analysis to discretization techniques, such
as binning, histogram analysis, and clustering.

3.1.2 Comparing Classification and Prediction Methods:


 Accuracy:

The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).

The accuracy of a predictor refers to how well a given predictor can guess the value of
the predicted attribute for new or previously unseen data.
 Speed:
This refers to the computational costs involved in generating and using
the given classifier or predictor.
 Robustness:
This is the ability of the classifier or predictor to make correct
predictions given noisy data or data with missing values.
 Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
 Interpretability:

This refers to the level of understanding and insight that is provided by the classifier or
predictor.
Interpretability is subjective and therefore more difficult to assess.
3.2 Classification by Decision Tree Induction:

Decision tree induction is the learning of decision trees from class-labeled training
tuples. A decision tree is a flowchart-like tree structure, where
 Each internal node denotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node holds a class label.
 The topmost node in a tree is the root node.

The construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.

Decision trees can handle high-dimensional data.

Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.

The learning and classification steps of decision tree induction are simple and
fast. In general, decision tree classifiers have good accuracy.
Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy,
and molecular biology.
3.2.1 Algorithm For Decision Tree Induction:

The algorithm is called with three parameters:


 Data partition
 Attribute list
 Attribute selection method

The parameter attribute list is a list of attributes describing the tuples.

The attribute selection method specifies a heuristic procedure for selecting the attribute
that
―best‖ discriminates the given tuples according to class.
The tree starts as a single node, N, representing the training tuples in D.
If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class.

All of the terminating conditions are explained at the end of the algorithm.

Otherwise, the algorithm calls the Attribute selection method to determine the
splitting criterion.

The splitting criterion tells us which attribute to test at node N by determining ―the best‖
way to separate or partition the tuples in D into individual classes.

There are three possible scenarios. Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.

1 A is discrete-valued:

In this case, the outcomes of the test at node N correspond directly to the
known values of A.
A branch is created for each known value, a j, of A and labeled with that
value. A need not be considered in any future partitioning of the tuples.

2 A is continuous-valued:

In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
where the split point is the split point returned by t h e Attribute selection method as
part of the splitting criterion.

3 A is discrete-valued and a binary tree must be produced:

The test at node N is of the form―A€SA?‖.


SA is the splitting subset for A, returned by the Attribute selection method as part of the
splitting criterion. It is a subset of the known values of A.
(a) If A is Discrete valued (b)If A is continuous-valued (c) If A is discrete-valued and a
binary tree must be produced:

3.3 Bayesian Classification:

Bayesian classifiers are statistical classifiers.

They can predict class membership probabilities, such as the probability that a given
tuple belongs to a particular class.

Bayesian classification is based on Bayes’ theorem.

3.3.1 Bayes’ Theorem:


Let X be a data tuple. In Bayesian terms, X is considered ―evidence. ‖and it is described by
measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that hypothesis H
holds given the ―evidence‖ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

3.3.2 Naïve Bayesian Classification:

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1.Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only
if

Thus we maximize P(CijX). The class Ci for which P(CijX) is maximized is called
the maximum posterior hypothesis. By Bayes’ theorem

3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci) P(Ci).
4. Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci). To reduce computation in evaluating P(X|Ci), the naive
assumption of class conditional independence is made. This presumes that the values of the
attributes are unconditionally independent of one another, given the class label of the tuple.
Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci),.., P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
 If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D having the value
xk for Ak, divided by |Ci,D| the number of tuples of class Ciin D.
 If Akis continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward.
A continuous-valued attribute is typically assumed to have a Gaussian distribution with
a mean μ and standard deviation, defined by

5.To predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ciif and only if

3.4 A Multilayer Feed-Forward Neural Network:


The backpropagation algorithm performs learning on a multilayer feed-forward neural
network.
It iteratively learns a set of weights for the prediction of the class label of tuples.
A multilayer feed-forward neural network consists of an input layer, one or more

hidden layers, and an output layer.


Example:

The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples

3.5 K-Nearest-Neighbor Classifier:

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing


a given test tuple with training tuples that are similar to it.
The training tuples are described by n attributes. Each tuple represents a point in an n-
dimensional space. In this way, all of the training tuples are stored in an n-dimensional
pattern space. When given an unknown tuple, a k-nearest-neighbor classifier searches the
pattern space for the k-training tuples that are closest to the unknown tuple. These k-
training tuples are the k-nearest neighbors of the unknown tuple.
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … , x1n) and
X2 = (x21, x22, … ,x2n), is

In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1 and tuple X2, square this difference, and accumulate it.
The square root is taken from the total accumulated distance count.
Min-Max normalization can be used to transform the value v of a numeric attribute A to
v0 in the range [0, 1] by computing

wherein and max are the minimum and maximum values of attribute A

For k-nearest-neighbor classification, the unknown tuple is assigned the most


common class among its k-nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
it in pattern space.
Nearest neighbor classifiers can also be used for prediction, that is, to return a real-
valued prediction for a given unknown tuple.
In this case, the classifier returns the average value of the real-valued labels
associated with the k nearest neighbors of the unknown tuple.
.

3.6 Classifier Accuracy:

The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
the classifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.

The error rate or misclassification rate of a classifier, M, is simply 1-Acc(M),


where Acc(M) is the accuracy of M.

The confusion matrix is a useful tool for analyzing how well your classifier can recognize
tuples of different classes.

True positives refer to the positive tuples that were correctly labeled by the
classifier. True negatives are the negative tuples that were correctly labeled by the
classifier.
False positives are the negative tuples that were incorrectly labeled.

How well the classifier can recognize, for this sensitivity and specificity measures can be
used.
Accuracy is a function of sensitivity and specificity.

wheret _posis the number of true positives


posis the number of positive tuples
t _negis the number of true negatives
negis the number of negative tuples,
f _posis the number of false positives
Chapter-4

4.1 Cluster Analysis:

The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.

A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other clusters.

A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.

Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and
SAS.

4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.

In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.

In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.

Clustering may also help in the identification of areas of similar land use in an earth
observation database and the identification of groups of houses in a city according to house
type, value, and geographic location, as well as the identification of groups of automobile
insurance policyholders with a high average claim cost.

Clustering is also called data segmentation in some applications because


clustering partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

4.1.2 Typical Requirements of Clustering in Data Mining:


 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms that can
detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
 Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clustering’s depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms that are
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensions or attributes. Many
clustering algorithms are good at handling low-dimensional data, involving only two to
three dimensions. Human eyes are good at judging the quality of clustering for up to three
dimensions. Finding clusters of data objects in high dimensional space is challenging,
especially considering that such data can be sparse and highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
while considering constraints such as the city’s rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

4.2 Major Clustering Methods:


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

4.2.1 Partitioning Methods:


A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:

Each group must contain at least one object,


and Each object must belong to exactly one
group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:


A hierarchical method creates a hierarchical decomposition of the given set of data objects.
A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed.

 The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups that are close to
one another, until all of the groups are merged into one or until a termination condition
holds.
 The divisive approach, also called the top-down approach, starts with all of the objects in
the same cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each objects in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can
never be undone. This rigidity is useful in that it leads to smaller computation costs by not
having to worry about a combinatorial number of different choices.

There are two approaches to improving the quality of hierarchical clustering:

 Perform careful analysis of objective linkages‖ at each hierarchical partitioning, such as


in Chameleon, or
 Integrate hierarchical Agglomeration and other approaches by first using a hierarchical
Agglomerative algorithm to group objects into micro clusters, and then performing macro
clustering on the micro clusters using another clustering method such as iterative
relocation.
4.2.3 Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty in discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
4.2.4 Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

4.2.5 Model-Based Methods:


 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

4.3 Classical Partitioning Methods:


The most well-known and commonly used partitioning methods are
 The k-Means Method
 k-Medoids Method
4.3.1 Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a
cluster mean or center.

For each of the remaining objects, an object is assigned to the cluster to which it is
the most similar, based on the distance between the object and the cluster mean.

It then computes the new mean for each cluster.


This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

where is the sum of the square error for all objects in the data set
is the point in space representing a given object
mi is the mean of cluster Ci

4.4.1 The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.
Clustering of a set of objects based on the k-means method

4.4 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.

The quality of a pure hierarchical clustering method suffers from its inability to perform
adjustment once an emerge or split decision has been executed. That is, if a particular
merge or split decision later turns out to have been a poor choice, the method cannot
backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,


depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

4.4.1 Agglomerative Hierarchical Clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters until all of the objects are in a single
cluster or until certain termination conditions are satisfied.

Most hierarchical clustering methods belong to this category. They differ only in their
definition of inter cluster similarity.

4.4.2 Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster.
It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster
on its own or until it satisfies certain termination conditions, such as a desired number of
clusters is obtained or the diameter of each cluster is within a certain threshold.
4.5 Constraint-Based Cluster Analysis:

Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints.


Depending on the nature of the constraints, constraint-based clustering may adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:

We can specify constraints on the objects to be clustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the set of objects to be clustered. It can easily be
handled by preprocessing after which the problem reduces to an instance of unconstrained
clustering.

 Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired number of clusters in a k-means algorithm; or the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine-tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance or similarity functions for specific attributes of the
objects to be clustered, or different distance measures for specific pairs of objects. When
clustering sportsmen, for example, we may use different weighting schemes for height,
body weight, age, and skill level. Although this will likely change the mining results, it
may not alter the clustering process per se. However, in some cases, such changes may
make the evaluation of the distance function nontrivial, especially when it is tightly
intertwined with the clustering process.
 User-specified constraints on the properties of individual clusters:
A user may like to specify the desired characteristics of the resulting clusters, which may
strongly influence the clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervised clustering can be significantly improved using some weak
form of supervision. This may be in the form of pairwise constraints (i.e., pairs of objects
labeled as belonging to the same or different cluster). Such a constrained clustering process
is called semi-supervised clustering.
4.6 Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.

Many data mining algorithms try to minimize the influence of outliers or eliminate them.
This, however, could result in the loss of important hidden information because one
person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.

It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medical analysis for finding unusual responses to various medical
treatments.

Outlier mining can be described as follows: Given a set of n data points or objects and k, the
expected number of outliers, find the top k objects that are considerably dissimilar,
exceptional, or inconsistent concerning the remaining data. The outlier mining problem can
be viewed as two subproblems:

Define what data can be considered as inconsistent in a given data set,


and Find an efficient method to mine the outliers so defined.
Types of outlier detection:
 Statistical Distribution-Based Outlier Detection
 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection

4.6.1 Statistical Distribution-Based Outlier Detection:


The statistical distribution-based approach to outlier detection assumes a distribution or
probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers concerning the model using a discordancy test. Application of the test
requires knowledge of the data set parameters knowledge of distribution parameters such
as the mean and variance and the expected number of outliers.
A statistical discordancy test examines two

hypotheses: A working hypothesis


An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (or
small) about the distribution F. Different test statistics have been proposed for use as a
discordancy test, depending on the available knowledge of the data. Assuming that some
statistic, T, has been chosen for discordancy testing, and the value of the statistic for
object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that o i comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H: oi € G, where i = 1, 2…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have the potential
to produce outliers. For example, it may have a different mean or dispersion, or a
longer tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other populations,
G. In this case, the alternative hypothesis is

Slippage alternative distribution:


This alternative states that all of the objects (apart from some prescribed small number)
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliers or all of them are
accepted as consistent.
Consecutive procedures:
An example of such a procedure is the inside-out procedure. Its main idea is that the
object that is least likely to be an outlier is tested first. If it is found to be an outlier, then
all of the
more extreme values are also considered outliers; otherwise, the next most extreme object is
tested, and so on. This procedure tends to be more effective than block procedures.

4.6.2 Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitations imposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and d min, that is, a DB(pct; d min)-an outlier, if at least a fraction, pct, of the
objects in D, lie at a distance greater than d min from o. In other words, rather than relying on
statistical tests, we can think of distance-based outliers as those objects that do not have
enough neighbors, where neighbors are defined based on distance from the given object. In
comparison with statistical-based methods, distance-based outlier detection generalizes the
ideas behind discordancy testing for various standard distributions. Distance-based outlier
detection avoids the excessive computation that can be associated with fitting the observed
distribution into some standard distribution and in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier according to the
given test, then o is also a DB(pct, d min)-outlier for some suitably defined pct and d min.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition
can be generalized by a DB(0.9988, 0.13s) outlier.
Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensional indexing structures, such
as R-trees or k-d trees, to search for neighbors of each object o within a radius of around that
object. Let M be the maximum number of objects within the d min-neighborhood of an
outlier. Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier.
This algorithm has a worst-case complexity of O(n2k), where n is the number of objects in
the data set and k is the dimensionality. The index-based algorithm scales well as k increases.
However, thiscomplexity evaluation takes only the search time into account, even though the
taskof building an index in itself can be computationally intensive.
Nested-loop algorithm:
The nested-loop algorithm has the same computational complexity as the index-based
algorithm but avoids index structure construction and tries to minimize the number of I/Os. It
divides the memory buffer space into two halves and the data set into several logical blocks.
By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoid (n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.
In this method, the data space is partitioned into cells with a side length equal to Each cell
has two layers surrounding it. The first layer is one cell thick, while the second is

cells thick, rounded up to the closest integer. The algorithm counts outliers on a
cell-by-cell rather than an object-by-object basis. For a given cell, it accumulates three counts
—the number of objects in the cell, in the cell and the first layer together, and in the cell and
both layers together. Let’s refer to these counts as cell count, cell + 1 layer count, and cell + 2
layers count, respectively.

Let M be the maximum number of outliers that can exist in the neighborhood of an outlier.

An object, o, in the current cell is considered an outlier only if cell + 1 layer count is less
than or equal to M. If this condition does not hold, then all of the objects in the cell can
be removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in the cell are
considered outliers. Otherwise, if this number is more than M, then some of the objects in
the cell may be outliers. To detect these outliers, object-by-object processing is used
where, for each object, o, in the cell, objects in the second layer of o are examined. For
objects in the cell, only those objects having no more than M points in their d min-
neighborhoods are outliers. The d min-neighborhood of an object consists of the
object’s cell, all of its first layer, and some of its second layer.
A variation to the algorithm is linear concerning n and guarantees that no more than three
passes over the data set are required. It can be used for large disk-resident data sets, yet does
not scale well for high dimensions.

4.6.3 Density-Based Local Outlier Detection:


Statistical and distance-based outlier detection both depend on the overall or global
distribution of the given set of data points, D. However, data are usually not uniformly
distributed. These methods encounter difficulties when analyzing data with rather different
density distributions.
To define the local outlier factor of an object, we need to introduce the concepts of-
distance, k-distance neighborhood, reachability distance,13, and local reachability density.
These are defined as follows:
The k-distance of an object p is the maximal distance that p gets from its k- k-
nearest neighbors. This distance is denoted as k-distance(p). It is defined as the distance,
d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2 D, it holds that
d(p, o’)_d(p, o). That is, there are at least k objects that are as close as or closer to p than o,
and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You may be wondering at this
point how k is determined. The LOF method links to density-based clustering in that it sets k
to the parameter rMinPts, which specifies the minimum number of points for use in
identifying clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted N kdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get N MinPts(p). It contains the MinPts-nearest neighbors of p. That is, it
contains every object whose distance is not greater than the MinPts-distance of p.
The reachability distance of an object p concerning object o (where o is within the
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPts distance(o), d(p, o)}.
Intuitively, if an object p is far away, then the reachability distance between the two is simply
their actual distance. However, if they are sufficiently close (i.e., where p is within the
MinPts-distance neighborhood of o), then the actual distance is replaced by the MinPts-
distance of o. This helps to significantly reduce the statistical fluctuations of d(p, o) for all of
the p close to o.
The higher the value of MinPts is, the more similar the reachability distance for objects
within the same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as

The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as

It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
4.6.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-based measures to
identify exceptional objects. Instead, it identifies outliers by examining the main
characteristics of objects in a group. Objects that ―deviate‖ from this description are
considered outliers. Hence, in this approach, the term deviations is typically used to refer to
outliers. In this section, we study two techniques for deviation-based outlier detection. The
first sequentially compares objects in a set, while the second employs an OLAP data cube
approach.

Sequential Exception Technique:


The sequential exception technique simulates how humans can distinguish unusual objects
from among a series of supposedly like objects. It uses implicit redundancy of the data.
Given a data set, D, of n objects, it builds a sequence of subsets,{D1, D2, …,Dm}, of these
objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique introduces the
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the small subset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between the objects. It is any function that, if
given a set of objects, returns a low value if the objects are similar to one another. The
greater the dissimilarity among the objects, the higher the value returned by the function. The
dissimilarity of a subset is incrementally computed based on the subset before it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in the set, that is,

where x is the mean of the n numbers in the set. For character strings, the dissimilarity function
may be in the form of a pattern string (e.g., containing wildcard characters that is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that is not in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. It assesses how much the
dissimilarity can be reduced by removing the subset from the original set of objects.

You might also like