CS-DM Module-2
CS-DM Module-2
Data Preprocessing :
* Data cleaning can be applied to remove noise and correct inconsistencies in the data.
* Data integration merges data from multiple sources into coherent data store, such as a
data warehouse.
* Data reduction can reduce the data size by aggregating, eliminating redundant features, or
clustering, for instance. These techniques are not mutually exclusive; they may work
together.
Data that were inconsistent with other recorded data may have been deleted.
Missing data, particularly for tuples with missing values for some attributes, may need to
be inferred.
There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
Data cleaning routines work to ―clean‖ the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving inconsistencies.
Data integration is the process of integrating multiple databases cubes or files. Yet some
attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same(or almost the same) analytical results
1.DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers and correct inconsistencies in the data. Missing Values Many tuples have no
recorded value for several attributes, such as customer income.so we can fill the missing
values for this attributes. The following methods are useful for performing missing values
over several attributes:
1. Ignore the tuple: This is usually done when the class label missing (assuming the mining
task involves classification). This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor when the percentage of the
missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time –consuming and may not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value by the
same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the average
income of customers is $56,000. Use this value to replace the missing value for income.
5. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism or decision tree induction. For
example, using the other customer attributes in the sets decision tree is constructed to
predict the missing value for income.
Noise Data
Noise is a random error or variance in a measured variable. Noise is removed using data smoothing
techniques.
Binning:
Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that is the value
around it. The sorted values are distributed into a number of ―buckets‖ or ―bins―. Because binning
methods consult the neighborhood of values, they perform local smoothing. Sorted data for price (in
dollars): 3,7,14,19,23,24,31,33,
Bin 1: 3,7,14
Bin 2: 19,23,24
Bin 3: 31,33,38
In the above method the data for price are first sorted and then partitioned into
equalfrequency bins of size 3.
Bin 1: 8,8,8
Bin 2: 22,22,22
Bin 3: 34,34,34
In smoothing by bin means method, each value in a bin is replaced by the mean value of the
bin.
Bin 1: 3,3,14
Bin 2: 19,24,24
Bin 3: 31,31,38
In smoothing by bin boundaries, the maximum & minimum values in give bin or
identify as the bin boundaries. Each bin value is then replaced by the closest boundary value.
In general, the large the width, the greater the effect of the smoothing. Alternatively, bins
may be equal-width, where the interval range of values in each bin is constant .
Example 2: Remove the noise in the following data using smoothing techniques: 8,
4,9,21,25,24,29,26,28,15 Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34
Regression: Data can be smoothed by fitting the data to function, such as with regression. Linear
regression involves finding the ―best‖ line to fit two attributes (or variables), so that one attribute
can be used to predict the other. Multiple linear regressions is an extension of linear regression,
where more than two attributes are involved and the data are fit to a multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or
―clusters.‖ Intuitively, values that fall outside of the set of clusters may be considered outliers.
Inconsistent Data : Inconsistencies exist in the data stored in the transaction. Inconsistencies occur
due to occur during data entry, functional dependencies between attributes and missing values. The
inconsistencies can be detected and corrected either by manually or by knowledge engineering tools.
Data cleaning as a process
a) Discrepancy detection
b) Data transformations
a) Discrepancy detection The first step in data cleaning is discrepancy detection. It considers the
knowledge of meta data and examines the following rules for detecting the discrepancy. Unique
rules- each value of the given attribute must be different from all other values for that attribute.
Consecutive rules – Implies no missing values between the lowest and highest values for the
attribute and that all values must also be unique. Null rules - specifies the use of blanks, question
marks, special characters, or other strings that may indicates the null condition
Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal addresses, and
spell-checking) to detect errors and make corrections in the data
Data auditing tools – analyzes the data to discover rules and relationship, and detecting data that
violate such conditions.
b) Data transformations This is the second step in data cleaning as a process. After detecting
discrepancies, we need to define and apply (a series of) transformations to correct them. Data
Transformations Tools:
Data migration tools – allows simple transformation to be specified, such to replaced the
string ―gender‖ by ―sex‖.
ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms
through a graphical user interface(GUI)
2.Data Integration Data mining often requires data integration - the merging of data from
stores into a coherent data store, as in data warehousing. These sources may include
multiple data bases, data cubes, or flat files
b) Redundancy.
b)Redundancy is another important issue an attribute (such as annual revenue, for instance) may be
redundant if it can be ―derived‖ from another attribute are set of attributes. Inconsistencies in
attribute of dimension naming can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis and covariance analysis
c) Detection and Resolution of Data Value Conflicts. A third important issue in data
integration is the detection and resolution of data value conflicts. For example, for the
same real–world entity, attribute value from different sources may differ. This may be due
to difference in representation, scaling, or encoding. For instance, a weight attribute may
be stored in metric units in one system and British imperial units in another. For a hotel
chain, the price of rooms in different cities may involve not only different currencies but
also different services (such as free breakfast) and taxes. An attribute in one system may be
recorded at a lower level of abstraction than the ―same‖ attribute in another. Careful
integration of the data from multiple sources can help to reduce and avoid redundancies
and inconsistencies in the resulting data set. This can help to improve the accuracy and
speed of the sub sequent of mining process
3. Data Transformation:
Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline for
data analytics projects. Organizations that use on-premises data warehouses generally use
an ETL (extract, transform, and load) process, in which data transformation is the middle
step. Today, most organizations use cloud-based data warehouses to scale compute and
storage resources with latency measured in seconds or minutes. The scalability of the cloud
platform lets organizations skip preload transformations and load raw data into the data
warehouse, then transform it at query time.
4. Data Reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results. Why data
reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set
Data reduction strategies 4.1.Data cube aggregation 4.2.Attribute Subset Selection 4.3.Numerosity
reduction — e.g., fit data into models 4.4.Dimensionality reduction - Data Compression
Data cube aggregation: For example, the data consists of All Electronics sales per
quarter for the years 2014 to 2017.You are, however, interested in the annual sales, rather
than the total per quarter. Thus, the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter.
Attribute Subset Selection Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or dimensions). The goal of attribute
subset selection is to find a minimum set of attributes. It reduces the number of
attributes appearing in the discovered patterns, helping to make the patterns easier to
understand. For n attributes, there are 2n possible subsets. An exhaustive search for
the optimal subset of attributes can be prohibitively expensive, especially as n and the
number of data classes increase. Therefore, heuristic methods that explore a reduced
search space are commonly used for attribute subset selection. These methods are
typically greedy in that, while searching to attribute space, they always make what
looks to be the best choice at that time.
• Sampling
• Dimensionality reduction
• Feature creation
• Variable transformation.
malfunctions.
Data that were inconsistent with other recorded data may have been deleted.
Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred.
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data entry.
There may be technology limitations, such as limited buffer size for coordinating
attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.
Data transformation is a kind of operations, such as normalization and aggregation, are
additional data preprocessing procedures that would contribute toward the success of the
mining process.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same(or almost the same) analytical results.or thousands of
transactions that occur daily at a specific store to a single daily transaction, and the
number of data objects is reduced to the number of stores
Aggregation:
It is the combining of two or more objects into a single object. Consider a data set
consisting of transactions (data objects) recording t he daily sales of products in
various store locations (Minneapolis, Chicago, Paris, ... ) for different days over the
course of a year.
One way to aggregate transactions for this data set is to replace all the transactions of
a single store with a single storewide transaction. This reduces the hundreds
First, the smaller data sets resulting from data reduction require less memory
and processing time, and hence, aggregation may permit the use of more expensive
data mining algorithms
A disadvantage of aggregation is the potential loss of interesting details. In the store example
aggregating over months loses information about which day of the week has the highest sales.
Sampling
Types of sampling:
Simple Random Sampling – There are two variations on random sampling (and other sampling
techniques as
– Sampling without replacement . As each item is selected, it is removed from the population
– Sampling with replacement . Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up more
than once
Stratified sampling – Split the data into several partitions; then draw random samples from
each partition
Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained. while this
technique eliminates the need to determine the correct sample size initially, it requires that
there be a way to evaluate the sample to judge if it is large enough.
When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Definitions of density and distance between points, which are critical for clustering and outlier
detection, become less meaningful.
Linear Algebra Techniques for Dimensionality Reduction Some of the most common approaches
for dimensionality reduction, particularly for continuous data, use techniques from linear
algebra to project the data from a high-dimensional space into a lower-dimensional space.
Principal Components Analysis (PCA) is a linear algebra technique for continuous attributes that
finds new attributes (principal components) that (1) linear combinations of the origin al
attributes,
(3) capture the maximum amount of variation in the data. For example, the first two principal
components capture as much of the variation in the data as is possible with two orthogonal
attributes that are linear combinations of the original attributes. Singular Value Decomposition
(SVD) is a linear algebra technique that is related to PCA and is also commonly used for
dimensionality reduction.
Feature Subset Selection : Another way to reduce the dimensionality is to use only a subset
of the features.
Redundant features duplicate much or all of the information contained in one or more other
attributes. For example, the purchase price of a product and the amount of sales tax paid
contain much of the same information. Irrelevant features contain almost no useful information
for the data mining task at hand. For instance, students' ID numbers are irrelevant to the task of
predicting students' grade point averages. Redundant and irrelevant features can reduce
classification accuracy and the quality of the clusters that are found .
There are three standard approaches to feature selection: embedded, filter, and wrapper
Embedded approaches : Feature selection occurs naturally as part of the data mining algorithm.
Specifically, during the operation of the data mining algorithm, the algorithm itself decides
which attributes to use and which to ignore. Algorithms for building decision tree classifiers,
Filter approaches Features are selected before the data mining algorithm is run , using some
approach that is in dependent of the data mining task. For example, we might select sets of
attributes whose pair wise correlation is as low as possible.
Wrapper approaches These methods use the target data mining algorithm as a black box to
find the best subset of attributes, in a way similar to that of the ideal algorithm , but typically
without enumerating all possible subsets.
Finally, once a subset of features has been selected, the results of the target data mining
algorithm on the selected subset should be validated. A straightforward evaluation approach is
to run the algorithm with the full set of features and compare the full results to results obtained
using the subset of features.
Hopefully, the subset of features will produce results that are better than or almost as good as
those produced when using all features
Feature Creation: It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a dataset much more effectively. Three
related methodologies for creating new attributes are described next: feature extraction, mapping
the data to a new space, and feature construction.
Feature Extraction The creation of a new set of features from t he original raw data is known as
feature extraction. Consider a set of photographs, where each photograph is to be classified
according to whether or not it contains a. human face. The raw data is a set of pixels, and as such,
is not suitable for many types of classification algorithms
Mapping the Data to a New Space A totally different view of the data can reveal important and
interesting features. Consider, for example, time series data, which often contains periodic
patterns. If there is only a single periodic pattern and not much noise, then the pattern is easily
detected. If, on the other hand, there are a number of periodic patterns and a significant amount of
noise is present, then these patterns are hard to detect. Such patterns can, nonetheless, often be
detected by applying a Fourier transform to the time series in order to change to a representation
in which frequency information is explicit
Example (Fourier Analysis) . The time series presented in Figure (b) is the sum of three other
time series, two of which are shown in Figure (a) and have frequencies of 7 and 17 cycles per
second, respectively. The third time series is random noise. Figure (c) shows t he power
spectrum that can be computed after applying a Fourier transform to the original time series
Feature Construction Sometimes the features in the original data sets have the necessary
information, but it is not in a form suitable for the data mining algorithm. In this situation, one
or more new features constructed out of the orignal.
Example (Density). To illustrate this, consider a data set consisting of information about
historical artifacts, which, along with other information, contains the volume and mass of each
artifact. For simplicity, assume that these artifacts are made of a small number of materials
(wood, clay, bronze, gold) and that we want to classify the artifacts with respect to the material
of which t hey are made. In this case, a density feature constructed from the mass and volume
features, i.e. , density == mass/volume, would most directly yield an accurate classification
Discretization:
Algorithms that find association patterns require that the data be in the form of binary
attributes. Thus, it is often necessary to transform a continuous attribute into a categorical
attribute (discretization)
Binarization : both continuous and discrete attributes may need to be transformed into one or
more binary attributes (binarization).
A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in the interval [0, m - 1]. If t he
attribute is ordinal, then order must be maintained by the assignment. Next, convert each of
these m integers to a binary number. Since n = [log2(m)] binary digits are required to represent
these integers, represent these binary numbers using n binary attributes.
To illustrate, a categorical variable with 5 values {awful, poor, OK, good, great} would require
three binary variables X1 , x2, and X3. The conversion is shown in Table 2.5. Such a
transformation can cause complications, such as creating unintended relationships among the
transformed attributes. For example, in Table 2.5, attributes x2 and X3 are correlated because
information about the good value is encoded using both attributes. Furthermore, association
analysis requires asymmetric binary attributes, where only the presence of the attribute (value
= 1) is important. For association problems, it is therefore necessary to introduce one binary
attribute for each categorical value, as in Table 2.6.
In the first step, after the values of the continuous attribute are sorted, they are then divided
into n intervals by specifying n- 1 split points.
In the second , rather trivial step, all the values in one interval are mapped to the same
categorical value. Therefore, the problem of discretization is one of deciding how many split
points to choose and where to place them.
Example:
K-means are shown in Figures 2.13(b), 2.13(c), and 2.13(d), respectively. The split points are
represented as dashed lines. If we measure the performance of a discretization technique by
the extent to which different objects in different groups are assigned the same categorical
value,
then K-means performs best , followed by equal frequency, and finally, equal width.
Supervised Discretization: The discretization methods described above are usually better than
no discretization, but keeping the end purpose in mind and using additional information (class
labels) often produces better results. This should not be surprising, since an interval constructed
with no knowledge of class labels often contains a mixture of class labels. A conceptually simple
approach is to place the splits in a way t hat maximizes the purity of the intervals. In practice,
however, such an approach requires potentially arbitrary decisions about the purity of an
interval and the minimum size of an interval.
To overcome such concerns, some statistically based approaches start with each attribute value
as a separate interval and create larger intervals by merging adjacent intervals that are similar
according to a statistical test.
Categorical Attributes with Too Many Values Categorical attributes can sometimes have too
many values. If the categorical attribute is an ordinal attribute, then techniques similar to those
for continuous attributes can be used to reduce the number of categories. If the categorical
attribute is nominal, however, then other approaches are needed. Consequently, a department
name attribute might have dozens of different values. In this situation, we could use our
knowledge of the relationships among different departments to combine departments into larger
groups, such as engineering, social sciences, or biological sciences
applied to all the values of a variable. For example, if only the magnitude of a variable is
important, then the values of the variable can be transformed by taking the absolute value. two
important types of variable transformations: simple functional transformations and
normalization.
Variable transformations should be applied with caution since they change the nature of the
data. While this is what is desired, there can be problems if the nature of the transformation is
not fully appreciated. For instance, the transformation 1/x reduces the magnitude of values that
are 1 or larger, but increases the magnitude of values between 0 and 1. To illustrate, the values
{1,2,3} go to {1,!,n, but the values {1,! ~} go to {1,2,3}. Thus, for all sets of values, the
transformation 1/x reverses the order
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbor classification, and anomaly detection. For
convenience, the term proximity is used to refer to either similarity or dissimilarity. Since the
proximity between two objects is a function of the proximity between the corresponding
attributes of the two objects, the similarity between two objects is a numerical measure of the
degree to which the two objects are alike.
Consequently, similarities are higher for pairs of objects that are more alike. Similarities are
usually non-negative and are often between 0 (no similarity) and 1 (complete similarity). The
dissimilarity between two objects is a numerical measure of the degree to which t he two
objects are different. Dissimilarities are lower for more similar pairs of objects.
For objects with a single ordinal attribute, the situation is more complicated because
information about order should be taken into account. Consider an attribute that measures the
quality of a product, e.g., a candy bar, on the scale {poor, fair, OK, good, wonderful} . It would
seem reasonable that a product, P1, which is rated wonderful, would be closer to a product P2,
which is rated good, than it would be to a product P3, which is rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped to successive
integers, beginning at 0 or 1, e.g., {poor=O, fair=1, OK =2, good=3, wonderful=4}. Then, d(P1,P2)
= 3-2 = 1 or, if we want the dissimilarity to fall between 0 and 1, d(Pl, P2) = 2 = 0.25. A similarity
for ordinal attributes can then be defined as s = 1-d.
Dissimilarities and Similarities between Data Objects :
properties:
1. d(p,q) ≥ 0 for all p and q,and d(p,q)= 0 if and only if p=q,
2. d(p,q)=d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r)for all p, q, and r, where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several common
distance measures to compare multivariate data. We will assume that the attributes are all continuous.
a) Euclidean Distance
Assume that we have measurements xik, i=1,…,N, on variables k=1, …, p(also called attributes).
b) Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
whereλ≥1.ItisalsocalledtheLλmetric.
Note that λ and p are two different parameters. Dimension of the data matrix remains finite
c) Mahalanobis Distance
Let X be a N×p matrix.Then the ith row of X is
𝑥𝑇=
i (𝑥i1,𝑥 i2,…..,𝑥i𝑝)
Simple Matching Coefficient One commonly used similarity coefficient is the simple matching
Jaccard Coefficient :
The Jaccard coefficient, which is often symbolized by J, is given by the following equation:
Cosine Similarity :
The cosine similarity, defined next, is one of the most common measure of document
similarity. If x and y are two document vectors, then
This section provides specific examples of some similarity and dissimilarity measures.
Let x and y be two objects that consist of n binary attributes. The comparison of two such
objects, i.e., two binary vectors, leads to the following four quantities (frequencies):
Simple Matching Coefficient One commonly used similarity coefficient is the simple matchin g
coefficient (SMC) , which is defined as
Jaccard Coefficient : Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix. The Jaccard coefficient, which is often symbolized by J, is
given by the following equation:
Cosine Similarity : The cosine similarity, defined next, is one of the most common measure of
document similarity. If x and y are two document vectors, then
Extended Jaccard Coefficient:
The extended Jaccard coefficient can be used for document data and that reduces to the
Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also
known as the Tanimoto coefficient
Correlation The correlation between two data objects that have binary or continuous variables
is a measure of the linear relationship between the attributes of the objects. Pearson's
correlation coefficient between two data objects, x and y, is defined by the following equation:
(1) how to handle the case in which attributes have different scales and/ or are correlated.
(2) how to calculate proximity between objects that are composed of different types of
attributes, e.g., quantitative and qualitative.
(3) and how to handle proximity calculation when attributes have different weights; i.e., when
not all attributes contribute equally to the proximity of objects.
A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are
correlated, have different ranges of values (different variances), and the distribution of the data
is approximately Gaussian (normal). Specifically, the Mahalanobis distance between two objects
(vectors) x and y is defined as
A general approach is needed when the attributes are of different types. One straightforward
approach is to compute the similarity between each attribute separately. Then combine these
similarities using a method that results in a similarity between 0 and l. Typically, the overall
similarity is defined as the average of all the individual attribute similarities.
Using Weights :
This is not desirable when some attributes are more important to the definition of proximity
than others. To address these situation the formulas for proximity can be modified by weighting
the contribution of each attribute.
Formula:
Use Case: Suitable for continuous numerical data.
Considerations: Sensitive to outliers.
2. Manhattan Distance (L1 Norm):
Formula: d(x,y)= | x 1 − x 2 | + | y 1 − y 2 | .
Use Case: Suitable for sparse data and less sensitive to outliers than Euclidean distance.
3. Cosine Similarity:
Formula:
Use Case: Effective for text data, document similarity, and high-dimensional data.
Considerations: Ignores magnitude and focuses on direction.
4. Jaccard Similarity:
Formula:
Use Case: Suitable for binary or categorical data; often used in set comparisons.
5. Hamming Distance:
Formula: Number of positions at which the corresponding symbols differ.
Use Case: Applicable to binary or categorical data of the same length.
6. Minkowski Distance:
Formula:
Use Case: Generalization of Euclidean and Manhattan distances; the parameter p
determines the norm.
7. Correlation-based Measures:
Pearson Correlation Coefficient: Measures linear correlation.
Spearman Rank Correlation Coefficient: Measures monotonic relationships.
Use Case: Suitable for comparing the relationship between variables.
8. Mahalanobis Distance:
Formula:
Use Case: Effective when dealing with multivariate data with different scales.