0% found this document useful (0 votes)

28 views

CS-DM Module-2

Uploaded by

Varaha Giri

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

CS-DM Module-2

Uploaded by

Varaha Giri

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

UNIT-2

Data, Measuring Data Similarity and Dissimilarity

Data Preprocessing :

Preprocessing Real-world databases are highly susceptible to noisy, missing, and

inconsistent data due to their typically huge size (often several gigabytes or more) and their
likely origin from multiple, heterogeneous sources. Low-quality data will lead to low-quality
mining results, so we prefer a preprocessing concepts

Data Preprocessing Techniques

* Data cleaning can be applied to remove noise and correct inconsistencies in the data.

* Data integration merges data from multiple sources into coherent data store, such as a
data warehouse.

* Data reduction can reduce the data size by aggregating, eliminating redundant features, or
clustering, for instance. These techniques are not mutually exclusive; they may work
together.

* Data transformations, such as normalization, may be applied. Need for preprocessing 

Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.  Incomplete data can occur for a number of reasons:

 Attributes of interest may not always be available

 Relevant data may not be recorded due to misunderstanding, or because of equipment

malfunctions.

 Data that were inconsistent with other recorded data may have been deleted.

 Missing data, particularly for tuples with missing values for some attributes, may need to
be inferred.

 The data collection instruments used may be faulty.

 There may have been human or computer errors occurring at data entry.

 Errors in data transmission can also occur.

 There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.

 Data cleaning routines work to ―clean‖ the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving inconsistencies.

 Data integration is the process of integrating multiple databases cubes or files. Yet some
attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.

 Data transformation is a kind of operations, such as normalization and aggregation, are

additional data preprocessing procedures that would contribute toward the success of the
mining process.

 Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same(or almost the same) analytical results
1.DATA CLEANING

Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers and correct inconsistencies in the data. Missing Values Many tuples have no
recorded value for several attributes, such as customer income.so we can fill the missing
values for this attributes. The following methods are useful for performing missing values
over several attributes:

1. Ignore the tuple: This is usually done when the class label missing (assuming the mining
task involves classification). This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor when the percentage of the
missing values per attribute varies considerably.

2. Fill in the missing values manually: This approach is time –consuming and may not be
feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute value by the
same constant, such as a label like ―unknown‖ or -∞.

4. Use the attribute mean to fill in the missing value: For example, suppose that the average
income of customers is $56,000. Use this value to replace the missing value for income.

5. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism or decision tree induction. For
example, using the other customer attributes in the sets decision tree is constructed to
predict the missing value for income.

Noise Data

Noise is a random error or variance in a measured variable. Noise is removed using data smoothing
techniques.

Binning:

Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that is the value
around it. The sorted values are distributed into a number of ―buckets‖ or ―bins―. Because binning
methods consult the neighborhood of values, they perform local smoothing. Sorted data for price (in
dollars): 3,7,14,19,23,24,31,33,

Example 1: Partition into (equal-frequency) bins:

Bin 1: 3,7,14

Bin 2: 19,23,24

Bin 3: 31,33,38

In the above method the data for price are first sorted and then partitioned into
equalfrequency bins of size 3.

Smoothing by bin means:

Bin 1: 8,8,8

Bin 2: 22,22,22

Bin 3: 34,34,34

In smoothing by bin means method, each value in a bin is replaced by the mean value of the
bin.

For example, the mean of the values 3,7&14 in bin 1 is 8[(3+7+14)/3]

Smoothing by bin boundaries:

Bin 1: 3,3,14

Bin 2: 19,24,24

Bin 3: 31,31,38

In smoothing by bin boundaries, the maximum & minimum values in give bin or
identify as the bin boundaries. Each bin value is then replaced by the closest boundary value.
In general, the large the width, the greater the effect of the smoothing. Alternatively, bins
may be equal-width, where the interval range of values in each bin is constant .

Example 2: Remove the noise in the following data using smoothing techniques: 8,
4,9,21,25,24,29,26,28,15 Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:

Bin 1: 4, 8,9,15

Bin 2: 21,21,24,25

Bin 3: 26,28,29,34

Smoothing by bin means:

Bin 1: 9,9,9,9

Bin 2: 23,23,23,23

Bin 3: 29,29,29,29

Smoothing by bin boundaries:

Bin 1: 4, 4,4,15

Bin 2: 21,21,25,25

Bin3: 26,26,26,34

Regression: Data can be smoothed by fitting the data to function, such as with regression. Linear
regression involves finding the ―best‖ line to fit two attributes (or variables), so that one attribute
can be used to predict the other. Multiple linear regressions is an extension of linear regression,
where more than two attributes are involved and the data are fit to a multidimensional surface.

Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or
―clusters.‖ Intuitively, values that fall outside of the set of clusters may be considered outliers.

Inconsistent Data : Inconsistencies exist in the data stored in the transaction. Inconsistencies occur
due to occur during data entry, functional dependencies between attributes and missing values. The
inconsistencies can be detected and corrected either by manually or by knowledge engineering tools.
Data cleaning as a process

a) Discrepancy detection

b) Data transformations

a) Discrepancy detection The first step in data cleaning is discrepancy detection. It considers the
knowledge of meta data and examines the following rules for detecting the discrepancy. Unique
rules- each value of the given attribute must be different from all other values for that attribute.
Consecutive rules – Implies no missing values between the lowest and highest values for the
attribute and that all values must also be unique. Null rules - specifies the use of blanks, question
marks, special characters, or other strings that may indicates the null condition

Discrepancy detection Tools:

 Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal addresses, and
spell-checking) to detect errors and make corrections in the data
 Data auditing tools – analyzes the data to discover rules and relationship, and detecting data that
violate such conditions.
b) Data transformations This is the second step in data cleaning as a process. After detecting
discrepancies, we need to define and apply (a series of) transformations to correct them. Data
Transformations Tools:
 Data migration tools – allows simple transformation to be specified, such to replaced the
string ―gender‖ by ―sex‖.
 ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms
through a graphical user interface(GUI)
2.Data Integration Data mining often requires data integration - the merging of data from
stores into a coherent data store, as in data warehousing. These sources may include
multiple data bases, data cubes, or flat files

Issues in Data Integration

a) Schema integration & object matching.

b) Redundancy.

c) Detection & Resolution of data value conflict

a) Schema Integration & Object Matching Schema integration & object matching can be
tricky because same entity can be represented in different forms in different tables. This is
referred to as the entity identification problem. Metadata can be used to help avoid errors in
schema integration. The meta data may also be used to help transform the data.

b)Redundancy is another important issue an attribute (such as annual revenue, for instance) may be
redundant if it can be ―derived‖ from another attribute are set of attributes. Inconsistencies in
attribute of dimension naming can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis and covariance analysis

c) Detection and Resolution of Data Value Conflicts. A third important issue in data
integration is the detection and resolution of data value conflicts. For example, for the
same real–world entity, attribute value from different sources may differ. This may be due
to difference in representation, scaling, or encoding. For instance, a weight attribute may
be stored in metric units in one system and British imperial units in another. For a hotel
chain, the price of rooms in different cities may involve not only different currencies but
also different services (such as free breakfast) and taxes. An attribute in one system may be
recorded at a lower level of abstraction than the ―same‖ attribute in another. Careful
integration of the data from multiple sources can help to reduce and avoid redundancies
and inconsistencies in the resulting data set. This can help to improve the accuracy and
speed of the sub sequent of mining process
3. Data Transformation:

Data transformation is an essential data preprocessing technique that must be performed

on the data before data mining to provide patterns that are easier to understand.

Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline for
data analytics projects. Organizations that use on-premises data warehouses generally use
an ETL (extract, transform, and load) process, in which data transformation is the middle
step. Today, most organizations use cloud-based data warehouses to scale compute and
storage resources with latency measured in seconds or minutes. The scalability of the cloud
platform lets organizations skip preload transformations and load raw data into the data
warehouse, then transform it at query time.

4. Data Reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results. Why data
reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set

Data reduction strategies 4.1.Data cube aggregation 4.2.Attribute Subset Selection 4.3.Numerosity
reduction — e.g., fit data into models 4.4.Dimensionality reduction - Data Compression

Data cube aggregation: For example, the data consists of All Electronics sales per
quarter for the years 2014 to 2017.You are, however, interested in the annual sales, rather
than the total per quarter. Thus, the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter.

Attribute Subset Selection Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or dimensions). The goal of attribute
subset selection is to find a minimum set of attributes. It reduces the number of
attributes appearing in the discovered patterns, helping to make the patterns easier to
understand. For n attributes, there are 2n possible subsets. An exhaustive search for
the optimal subset of attributes can be prohibitively expensive, especially as n and the
number of data classes increase. Therefore, heuristic methods that explore a reduced
search space are commonly used for attribute subset selection. These methods are
typically greedy in that, while searching to attribute space, they always make what
looks to be the best choice at that time.

Data preprocessing is a broad area and consists of a number of different

strategies and techniques that are interrelated in complex ways.
• Aggregation

• Sampling

• Dimensionality reduction

• Feature subset selection

• Feature creation

• Discretization and binarization

• Variable transformation.

Need for preprocessing

 Incomplete, noisy and inconsistent data are common place properties of large real world

databases and data warehouses.

 Incomplete data can occur for a number of reasons:

 Attributes of interest may not always be available

 Relevant data may not be recorded due to misunderstanding, or because of equipment

malfunctions.
 Data that were inconsistent with other recorded data may have been deleted.

 Missing data, particularly for tuples with missing values for some attributes, may

need to be inferred.
 The data collection instruments used may be faulty.

 There may have been human or computer errors occurring at data entry.

 Errors in data transmission can also occur.

 There may be technology limitations, such as limited buffer size for coordinating

synchronized data transfer and consumption.

 Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
 Data integration is the process of integrating multiple databases cubes or files. Yet some

attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.
 Data transformation is a kind of operations, such as normalization and aggregation, are

additional data preprocessing procedures that would contribute toward the success of the
mining process.
 Data reduction obtains a reduced representation of data set that is much smaller in

volume, yet produces the same(or almost the same) analytical results.or thousands of
transactions that occur daily at a specific store to a single daily transaction, and the
number of data objects is reduced to the number of stores
Aggregation:

It is the combining of two or more objects into a single object. Consider a data set
consisting of transactions (data objects) recording t he daily sales of products in
various store locations (Minneapolis, Chicago, Paris, ... ) for different days over the
course of a year.

One way to aggregate transactions for this data set is to replace all the transactions of
a single store with a single storewide transaction. This reduces the hundreds

There are several motivations for aggregation.

First, the smaller data sets resulting from data reduction require less memory
and processing time, and hence, aggregation may permit the use of more expensive
data mining algorithms

Second, aggregation can act as a change of scope or scale by providing a high-level

view of the data instead of a low-level view. In the previous example, aggregating over
store locations and months gives us a monthly, per store view of the data instead of a
daily, per item view. Finally, the behavior of groups of objects or attributes is often
more stable than that of individual objects or attributes
This statement reflects the statistical fact that aggregate quantities, such as averages or totals,
have less variability than the individual objects being aggregated. For totals, the actual amount
of variation is larger than that of individual objects (on average), but the percentage of the
variation is smaller, while for means, the actual amount of variation is less than that of
individual objects (on average).

A disadvantage of aggregation is the potential loss of interesting details. In the store example
aggregating over months loses information about which day of the week has the highest sales.

Example: (Australian Precipitation). This example is based on precipitation in Australia from t he

period 1982 to 1993. Figure 2.8(a) shows a histogram for the standard deviation of average
monthly precipitation for 3,030 0.5° by 0.5° grid cells in Australia, while Figure 2.8(b) shows a
histogram for the standard deviation of the average yearly precipitation for the same locations.
T he average yearly precipitation has less variability than the average monthly precipitation. All
precipitation measurements (and their standard deviations) are in centimeters.

Sampling
Types of sampling:

Simple Random Sampling – There are two variations on random sampling (and other sampling
techniques as

– Sampling without replacement . As each item is selected, it is removed from the population
– Sampling with replacement . Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up more
than once

Stratified sampling – Split the data into several partitions; then draw random samples from
each partition

Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained. while this
technique eliminates the need to determine the correct sample size initially, it requires that
there be a way to evaluate the sample to judge if it is large enough.

 Dimensionality Reduction: There are a variety of benefits to dimensionality reduction . A

key benefit is that many data mining algorithms work better if the dimensionality-the number of
attributes in the data-is lower. This is partly because dimensionality reduction can eliminate
irrelevant features and reduce noise and partly because of the curse of dimensionality, which is
explained below. Another benefit is that a reduction of dimensionality can lead to a more
understandable model because the model may involve fewer attributes. Also, dimensionality
reduction may allow the data to be more easily visualized. Even if dimensionality reduction
doesn't reduce the data to two or three dimensions, data is often visualized by looking at pairs or
triplets of attributes, and t he number of such combinations is greatly reduced. Finally, the
amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
Curse of dimensionality:

When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Definitions of density and distance between points, which are critical for clustering and outlier
detection, become less meaningful.

Linear Algebra Techniques for Dimensionality Reduction Some of the most common approaches
for dimensionality reduction, particularly for continuous data, use techniques from linear
algebra to project the data from a high-dimensional space into a lower-dimensional space.
Principal Components Analysis (PCA) is a linear algebra technique for continuous attributes that
finds new attributes (principal components) that (1) linear combinations of the origin al
attributes,

(2) orthogonal (perpendicular) to each other, and

(3) capture the maximum amount of variation in the data. For example, the first two principal
components capture as much of the variation in the data as is possible with two orthogonal
attributes that are linear combinations of the original attributes. Singular Value Decomposition
(SVD) is a linear algebra technique that is related to PCA and is also commonly used for
dimensionality reduction.

 Feature Subset Selection : Another way to reduce the dimensionality is to use only a subset
of the features.
Redundant features duplicate much or all of the information contained in one or more other
attributes. For example, the purchase price of a product and the amount of sales tax paid
contain much of the same information. Irrelevant features contain almost no useful information
for the data mining task at hand. For instance, students' ID numbers are irrelevant to the task of
predicting students' grade point averages. Redundant and irrelevant features can reduce
classification accuracy and the quality of the clusters that are found .

There are three standard approaches to feature selection: embedded, filter, and wrapper

Embedded approaches : Feature selection occurs naturally as part of the data mining algorithm.
Specifically, during the operation of the data mining algorithm, the algorithm itself decides
which attributes to use and which to ignore. Algorithms for building decision tree classifiers,

Filter approaches Features are selected before the data mining algorithm is run , using some
approach that is in dependent of the data mining task. For example, we might select sets of
attributes whose pair wise correlation is as low as possible.

Wrapper approaches These methods use the target data mining algorithm as a black box to
find the best subset of attributes, in a way similar to that of the ideal algorithm , but typically
without enumerating all possible subsets.
Finally, once a subset of features has been selected, the results of the target data mining
algorithm on the selected subset should be validated. A straightforward evaluation approach is
to run the algorithm with the full set of features and compare the full results to results obtained
using the subset of features.

Hopefully, the subset of features will produce results that are better than or almost as good as
those produced when using all features

Feature Weighting : Feature weighting is an alternative to keeping or eliminating features.

More important features are assigned a higher weight, while less important features are given a
lower weight. These weights are sometimes assigned based on domain knowledge about the
relative importance of features. Alternatively, they may be determined automatically. For
example, some classification schemes, such as support vector machines, produce classification
models in which each feature is given a weight. Features with larger weights play a more
important role in the model

Feature Creation: It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a dataset much more effectively. Three
related methodologies for creating new attributes are described next: feature extraction, mapping
the data to a new space, and feature construction.

 Feature Extraction The creation of a new set of features from t he original raw data is known as
feature extraction. Consider a set of photographs, where each photograph is to be classified
according to whether or not it contains a. human face. The raw data is a set of pixels, and as such,
is not suitable for many types of classification algorithms
 Mapping the Data to a New Space A totally different view of the data can reveal important and

interesting features. Consider, for example, time series data, which often contains periodic
patterns. If there is only a single periodic pattern and not much noise, then the pattern is easily
detected. If, on the other hand, there are a number of periodic patterns and a significant amount of
noise is present, then these patterns are hard to detect. Such patterns can, nonetheless, often be
detected by applying a Fourier transform to the time series in order to change to a representation
in which frequency information is explicit

Example (Fourier Analysis) . The time series presented in Figure (b) is the sum of three other
time series, two of which are shown in Figure (a) and have frequencies of 7 and 17 cycles per
second, respectively. The third time series is random noise. Figure (c) shows t he power
spectrum that can be computed after applying a Fourier transform to the original time series

Feature Construction Sometimes the features in the original data sets have the necessary
information, but it is not in a form suitable for the data mining algorithm. In this situation, one
or more new features constructed out of the orignal.

Example (Density). To illustrate this, consider a data set consisting of information about
historical artifacts, which, along with other information, contains the volume and mass of each
artifact. For simplicity, assume that these artifacts are made of a small number of materials
(wood, clay, bronze, gold) and that we want to classify the artifacts with respect to the material
of which t hey are made. In this case, a density feature constructed from the mass and volume
features, i.e. , density == mass/volume, would most directly yield an accurate classification

Discretization and Binarization:

Discretization:

Algorithms that find association patterns require that the data be in the form of binary
attributes. Thus, it is often necessary to transform a continuous attribute into a categorical
attribute (discretization)

Binarization : both continuous and discrete attributes may need to be transformed into one or
more binary attributes (binarization).

A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in the interval [0, m - 1]. If t he
attribute is ordinal, then order must be maintained by the assignment. Next, convert each of
these m integers to a binary number. Since n = [log2(m)] binary digits are required to represent
these integers, represent these binary numbers using n binary attributes.

To illustrate, a categorical variable with 5 values {awful, poor, OK, good, great} would require
three binary variables X1 , x2, and X3. The conversion is shown in Table 2.5. Such a
transformation can cause complications, such as creating unintended relationships among the
transformed attributes. For example, in Table 2.5, attributes x2 and X3 are correlated because
information about the good value is encoded using both attributes. Furthermore, association
analysis requires asymmetric binary attributes, where only the presence of the attribute (value
= 1) is important. For association problems, it is therefore necessary to introduce one binary
attribute for each categorical value, as in Table 2.6.

Discretization of Continuous Attributes Discretization is typically applied to attributes that are

used in classification or association analysis. Transformation of a continuous attribute to a
categorical attribute involves two subtasks: deciding how many categories to have and
determining how to map the values of the continuous attribute to these categories.

In the first step, after the values of the continuous attribute are sorted, they are then divided
into n intervals by specifying n- 1 split points.

In the second , rather trivial step, all the values in one interval are mapped to the same
categorical value. Therefore, the problem of discretization is one of deciding how many split
points to choose and where to place them.

Unsupervised Discretization: A basic distinction between discretization methods for

classification is whether class information is used (supervised) or not (unsupervised). If class
information is not used, then relatively simple approaches are common. For instance, the equal
width approach divides the range of the attribute into a user-specified number of intervals each
having the same width. Such an approach can be badly affected by outliers, and for that reason ,
an equal frequency (equal depth) approach, which tries to put the same number of objects into
each interval , is often preferred

Example:

K-means are shown in Figures 2.13(b), 2.13(c), and 2.13(d), respectively. The split points are
represented as dashed lines. If we measure the performance of a discretization technique by
the extent to which different objects in different groups are assigned the same categorical
value,

then K-means performs best , followed by equal frequency, and finally, equal width.
Supervised Discretization: The discretization methods described above are usually better than
no discretization, but keeping the end purpose in mind and using additional information (class
labels) often produces better results. This should not be surprising, since an interval constructed
with no knowledge of class labels often contains a mixture of class labels. A conceptually simple
approach is to place the splits in a way t hat maximizes the purity of the intervals. In practice,
however, such an approach requires potentially arbitrary decisions about the purity of an
interval and the minimum size of an interval.

To overcome such concerns, some statistically based approaches start with each attribute value
as a separate interval and create larger intervals by merging adjacent intervals that are similar
according to a statistical test.
Categorical Attributes with Too Many Values Categorical attributes can sometimes have too
many values. If the categorical attribute is an ordinal attribute, then techniques similar to those
for continuous attributes can be used to reduce the number of categories. If the categorical
attribute is nominal, however, then other approaches are needed. Consequently, a department
name attribute might have dozens of different values. In this situation, we could use our
knowledge of the relationships among different departments to combine departments into larger
groups, such as engineering, social sciences, or biological sciences

Variable Transformation : A variable transformation refers to a transformation that is

applied to all the values of a variable. For example, if only the magnitude of a variable is
important, then the values of the variable can be transformed by taking the absolute value. two
important types of variable transformations: simple functional transformations and
normalization.

Simple Function Transformations: For this type of variable transformation, a simple

mathematical function is applied to each value individually. If x is a variable, then examples of
such transformations include xk, log x, e"', ../X, 1/ x, sin x, or lxl

Variable transformations should be applied with caution since they change the nature of the
data. While this is what is desired, there can be problems if the nature of the transformation is
not fully appreciated. For instance, the transformation 1/x reduces the magnitude of values that
are 1 or larger, but increases the magnitude of values between 0 and 1. To illustrate, the values
{1,2,3} go to {1,!,n, but the values {1,! ~} go to {1,2,3}. Thus, for all sets of values, the
transformation 1/x reverses the order

Normalization or Standardization : Another common type of variable transformation is the

standardization or normalization of a variable. (In the data mining community the terms are
often used interchangeably. In statistics, however, the term normalization can be confused with
the transformations used for making a variable normal, i.e., Gaussian.) The goal of
standardization or normalization is to make an entire set of values have a particular property. A
traditional example is that of "standardizing a variable" in statistics. If x is the mean (average) of
the attribute values and s:r is their standard deviation, then t he transformation x' = (x - x)/ Sx
creates a new variable that has a mean of 0 and a standard deviation of 1.

Measures of Similarity and Dissimilarity:

Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbor classification, and anomaly detection. For
convenience, the term proximity is used to refer to either similarity or dissimilarity. Since the
proximity between two objects is a function of the proximity between the corresponding
attributes of the two objects, the similarity between two objects is a numerical measure of the
degree to which the two objects are alike.

Consequently, similarities are higher for pairs of objects that are more alike. Similarities are
usually non-negative and are often between 0 (no similarity) and 1 (complete similarity). The
dissimilarity between two objects is a numerical measure of the degree to which t he two
objects are different. Dissimilarities are lower for more similar pairs of objects.

Transformations : Transformations are often applied to convert a similarity to a dissimilarity, or

vice versa, or to transform a proximity measure to fall within a particular range, such as [0,1].
For instance, we may have similarities that range from 1 to 10, but the particular algorithm or
software package that we want to use may be designed to only work with dissimilarities, or it
may only work with similarities in the interval [0,1] by using the transformations'= (s - 1)/9,
where sand s' are the original and new similarity values, respectively. In the more general case,
the transformation of similarities to the interval [0, 1] is given by the expressions' = (s -
min_s)/(max_s - min_s), where max_s and min_s are the maximum and minimum similarity
values, respectively. Likewise, dissimilarity measures with a finite range can be mapped to the
interval [0,1] by using the formula d' = (d- min_d)/(max_d- min_d).

Similarity and Dissimilarity between Simple Attributes : The proximity of objects

with a number of attributes is typically defined by combining t he proximities of individual
attributes, and thus, we first discuss proximity between objects having a single attribute.

For objects with a single ordinal attribute, the situation is more complicated because
information about order should be taken into account. Consider an attribute that measures the
quality of a product, e.g., a candy bar, on the scale {poor, fair, OK, good, wonderful} . It would
seem reasonable that a product, P1, which is rated wonderful, would be closer to a product P2,
which is rated good, than it would be to a product P3, which is rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped to successive
integers, beginning at 0 or 1, e.g., {poor=O, fair=1, OK =2, good=3, wonderful=4}. Then, d(P1,P2)
= 3-2 = 1 or, if we want the dissimilarity to fall between 0 and 1, d(Pl, P2) = 2 = 0.25. A similarity
for ordinal attributes can then be defined as s = 1-d.
Dissimilarities and Similarities between Data Objects :

Common Properties of Dissimilarity Measures

 Distance, such as the Euclidean distance, is a dissimilarity measure and has some well known

properties:
1. d(p,q) ≥ 0 for all p and q,and d(p,q)= 0 if and only if p=q,
2. d(p,q)=d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r)for all p, q, and r, where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several common

distance measures to compare multivariate data. We will assume that the attributes are all continuous.

a) Euclidean Distance
 Assume that we have measurements xik, i=1,…,N, on variables k=1, …, p(also called attributes).

 The Euclidean distance between the ith and jth objects is

 For every pair(i, j) of observations. The weighted Euclidean distance is

 If scales of the attributes differ substantially, standardization is necessary.

b) Minkowski Distance
 The Minkowski distance is a generalization of the Euclidean distance.

 With the measurement ,xik,i=1,…,N, k=1,… ,p ,the Minkowski distance is

 whereλ≥1.ItisalsocalledtheLλmetric.

 λ=1:L1 metric, Man hattan or City-block distance. λ = 2 : L2metric, Euclidean distance.

 λ→∞: L∞metric, Supremum distance

 Note that λ and p are two different parameters. Dimension of the data matrix remains finite
c) Mahalanobis Distance
 Let X be a N×p matrix.Then the ith row of X is

𝑥𝑇=
i (𝑥i1,𝑥 i2,…..,𝑥i𝑝)

 The Mahalanobis distance is

 Where ∑ is the p×p sample covariance matrix.y is 1

f10 = the number of attributes where x is 1 and y is 0

f11 = the number of attributes where x is 1 and y is 1

 Simple Matching Coefficient One commonly used similarity coefficient is the simple matching

coefficient (SMC) , which is defined as

Jaccard Coefficient :

The Jaccard coefficient, which is often symbolized by J, is given by the following equation:

Cosine Similarity :

The cosine similarity, defined next, is one of the most common measure of document
similarity. If x and y are two document vectors, then

Common Properties of Similarity Measures

 Similarities have some well known properties:

1.s(p,q)=1(or maximum similarity) only if p =q,

2.s(p, q)= s(q, p)for all p and q,where s(p,q) is the similarity between data objects, p and q.
Examples of Proximity Measures :

This section provides specific examples of some similarity and dissimilarity measures.

Similarity Measures for Binary Data:

Similarity measures between objects that contain only binary attributes are called similarity
coefficients, and typically have values between 0 and 1. A value of 1 indicates that the two
objects are completely similar, while a value of 0 indicates that the objects are not at all similar.

Let x and y be two objects that consist of n binary attributes. The comparison of two such
objects, i.e., two binary vectors, leads to the following four quantities (frequencies):

f00 = the number of attributes where x is 0 and y is 0

f01 = the number of attributes where x is 0 and y is 1

f10 = the number of attributes where x is 1 and y is 0

f11 = the number of attributes where x is 1 and y is 1

Simple Matching Coefficient One commonly used similarity coefficient is the simple matchin g
coefficient (SMC) , which is defined as

Jaccard Coefficient : Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix. The Jaccard coefficient, which is often symbolized by J, is
given by the following equation:

Cosine Similarity : The cosine similarity, defined next, is one of the most common measure of
document similarity. If x and y are two document vectors, then
Extended Jaccard Coefficient:

The extended Jaccard coefficient can be used for document data and that reduces to the
Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also
known as the Tanimoto coefficient

Correlation The correlation between two data objects that have binary or continuous variables
is a measure of the linear relationship between the attributes of the objects. Pearson's
correlation coefficient between two data objects, x and y, is defined by the following equation:

Issues in Proximity Calculation :

This section discusses several important issues related to proximity measures:

(1) how to handle the case in which attributes have different scales and/ or are correlated.

(2) how to calculate proximity between objects that are composed of different types of
attributes, e.g., quantitative and qualitative.
(3) and how to handle proximity calculation when attributes have different weights; i.e., when
not all attributes contribute equally to the proximity of objects.

A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are
correlated, have different ranges of values (different variances), and the distribution of the data
is approximately Gaussian (normal). Specifically, the Mahalanobis distance between two objects
(vectors) x and y is defined as

Combining Similarities for Heterogeneous Attributes :

A general approach is needed when the attributes are of different types. One straightforward
approach is to compute the similarity between each attribute separately. Then combine these
similarities using a method that results in a similarity between 0 and l. Typically, the overall
similarity is defined as the average of all the individual attribute similarities.

Using Weights :

This is not desirable when some attributes are more important to the definition of proximity
than others. To address these situation the formulas for proximity can be modified by weighting
the contribution of each attribute.

If the weights wk sum to 1, then it becomes

The definition of the Minkowski distance can also be modified as follows:

some common issues associated with proximity measures:

1. Sensitivity to Scale:
 Problem: Many proximity measures are sensitive to the scale of the variables. If the
scales are not standardized, variables with larger magnitudes may dominate the distance
calculations.
 Solution: Standardize or normalize the variables before applying proximity
measures to ensure that all variables contribute equally.
2. Dimensionality:
 Problem: In high-dimensional spaces, the distance between points may become
less meaningful due to the "curse of dimensionality." This can lead to increased
computational complexity and decreased performance.
 Solution: Dimensionality reduction techniques or feature selection methods can be
applied to address this issue.
3. Assumption of Linearity:
 Problem: Some proximity measures, like Euclidean distance, assume linear
relationships between variables. In non-linear scenarios, these measures may not capture
the true underlying similarities.
 Solution: Consider using proximity measures that are more suitable for non-linear
relationships, or transform the data to make it more linear if appropriate.
4. Outliers:
 Problem: Proximity measures can be sensitive to outliers, which might
disproportionately influence the results. Outliers can distort distance calculations and
lead to inaccurate similarity assessments.
 Solution: Robust proximity measures or outlier detection/preprocessing
techniques can be employed to mitigate the impact of outliers.
5. Metric vs. Non-metric Measures:
 Problem: Some proximity measures may violate the triangle inequality, a key
property for metrics. Non-metric measures can lead to inconsistencies in clustering or
classification algorithms.
 Solution: Carefully choose measures that satisfy the metric properties when
working with algorithms that assume metric distances.
6. Subjectivity in Measure Selection:
 Problem: The choice of proximity measure may depend on the specific
characteristics of the data and the problem at hand. Different measures may yield
different results.
 Solution: Understand the characteristics of your data and the requirements of your
application, and choose a proximity measure accordingly. Sensitivity analysis can also be
performed to assess the impact of different measures.
7. Data Sparsity:
 Problem: In sparse datasets, where many entries are missing or zero, traditional
proximity measures may not provide accurate similarity assessments.
 Solution: Consider using specialized measures designed for sparse data or impute
missing values before applying proximity measures

Selection of Right Proximity Measure :

Selecting the right proximity measures, also known as similarity or distance

measures, is crucial in data mining tasks such as clustering, classification, and
recommendation systems. The choice of proximity measure depends on the nature of
your data and the specific goals of your analysis. Here are some commonly used
proximity measures and factors to consider when selecting them:
1. Euclidean Distance:

 Formula:
 Use Case: Suitable for continuous numerical data.
 Considerations: Sensitive to outliers.
2. Manhattan Distance (L1 Norm):
 Formula: d(x,y)= | x 1 − x 2 | + | y 1 − y 2 | .
 Use Case: Suitable for sparse data and less sensitive to outliers than Euclidean distance.
3. Cosine Similarity:

 Formula:
 Use Case: Effective for text data, document similarity, and high-dimensional data.
 Considerations: Ignores magnitude and focuses on direction.
4. Jaccard Similarity:

 Formula:
 Use Case: Suitable for binary or categorical data; often used in set comparisons.
5. Hamming Distance:
 Formula: Number of positions at which the corresponding symbols differ.
 Use Case: Applicable to binary or categorical data of the same length.
6. Minkowski Distance:

 Formula:
 Use Case: Generalization of Euclidean and Manhattan distances; the parameter p
determines the norm.
7. Correlation-based Measures:
 Pearson Correlation Coefficient: Measures linear correlation.
 Spearman Rank Correlation Coefficient: Measures monotonic relationships.
 Use Case: Suitable for comparing the relationship between variables.
8. Mahalanobis Distance:

 Formula:
 Use Case: Effective when dealing with multivariate data with different scales.

When selecting a proximity measure, consider the following factors:

 Data Type: Choose a measure that is appropriate for the type of data you are working
with (e.g., numerical, categorical, binary).
 Scale Sensitivity: Some measures are sensitive to the scale of the variables, so
standardize or normalize your data if needed.
 Domain Knowledge: Consider the characteristics of your data and the problem domain.
 Computational Complexity: Some measures may be computationally expensive,
especially with large datasets.
 Noise and Outliers: Choose a measure that is robust to noise and outliers if your data
contains them.
 Interpretability: Consider the interpretability of the measure in the context of your
analysis.

Download ebooks file Human Motor Development: A Lifespan Approach 9th Edition, (Ebook PDF) all chapters
100% (2)
Download ebooks file Human Motor Development: A Lifespan Approach 9th Edition, (Ebook PDF) all chapters
55 pages
Well Test Continuous Assessment Final
100% (2)
Well Test Continuous Assessment Final
22 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
DMDW_
No ratings yet
DMDW_
14 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
data preprocessing
No ratings yet
data preprocessing
21 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Normalization
No ratings yet
Normalization
35 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
253777
No ratings yet
253777
66 pages
DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
DWM
No ratings yet
DWM
14 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
DMI UNIT 3
No ratings yet
DMI UNIT 3
12 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
7 pages
MODULE 2 DMW
No ratings yet
MODULE 2 DMW
19 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
2 - Data Mining and Warehousing - L2
No ratings yet
2 - Data Mining and Warehousing - L2
6 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
2 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DWM Question Bank Solution
No ratings yet
DWM Question Bank Solution
35 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
unit -1 (b) DWM.docx
No ratings yet
unit -1 (b) DWM.docx
26 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Cleansing
No ratings yet
Data Cleansing
5 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Data Binning
No ratings yet
Data Binning
9 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Schema Basics
From Everand
Data Schema Basics
Mei Gates
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
distributed-systems-pranay
No ratings yet
distributed-systems-pranay
108 pages
Lec22
No ratings yet
Lec22
22 pages
CS-DM Module-4
No ratings yet
CS-DM Module-4
22 pages
CS-DM MODULE-5
No ratings yet
CS-DM MODULE-5
26 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
Photoshop CS Tutorial
No ratings yet
Photoshop CS Tutorial
42 pages
FS2004 Weather Tutorial
No ratings yet
FS2004 Weather Tutorial
5 pages
1 Hpinb
No ratings yet
1 Hpinb
1 page
EShop - Exp 4 - Study of Home Appliances II-Refrigeration & Air-Conditioning Systems
No ratings yet
EShop - Exp 4 - Study of Home Appliances II-Refrigeration & Air-Conditioning Systems
8 pages
Chapter 6 Thermal Energy
No ratings yet
Chapter 6 Thermal Energy
4 pages
Management: Science, Theory, and Practice: The Functions of Management
No ratings yet
Management: Science, Theory, and Practice: The Functions of Management
35 pages
[FREE PDF sample] General Chemistry Laboratories A Freshman Workbook 2nd Edition Simon G Bott ebooks
No ratings yet
[FREE PDF sample] General Chemistry Laboratories A Freshman Workbook 2nd Edition Simon G Bott ebooks
80 pages
Differential Equations Viorel Barbu
100% (1)
Differential Equations Viorel Barbu
230 pages
Ac 120-48
No ratings yet
Ac 120-48
6 pages
For A Tin Ingot: The Archaeology of Oral Interpretation: Marzena Chrobak
No ratings yet
For A Tin Ingot: The Archaeology of Oral Interpretation: Marzena Chrobak
15 pages
S.no Particular Super Value BSNL CUL Plan 129 1 Monthly Charge 2 Free Calls
No ratings yet
S.no Particular Super Value BSNL CUL Plan 129 1 Monthly Charge 2 Free Calls
12 pages
Chemistry Project Final
No ratings yet
Chemistry Project Final
23 pages
7-Chromatography (Chap-26,27,28)
No ratings yet
7-Chromatography (Chap-26,27,28)
27 pages
VarietiesofNigerianEnglishIgboEnglishinNigerianliterature
No ratings yet
VarietiesofNigerianEnglishIgboEnglishinNigerianliterature
19 pages
Measuring Player Immersion in The Computer Game Narrative
No ratings yet
Measuring Player Immersion in The Computer Game Narrative
29 pages
Grade 7 2ND Quarter Lesson Plan
100% (1)
Grade 7 2ND Quarter Lesson Plan
2 pages
GC 2024 09 20
No ratings yet
GC 2024 09 20
20 pages
RC-800 Presentation 2014
100% (1)
RC-800 Presentation 2014
14 pages
ITP For Piping
No ratings yet
ITP For Piping
5 pages
Lab Gruppen FP2600
No ratings yet
Lab Gruppen FP2600
49 pages
Chemical Checkpoints: Department of Occupational Safety and Health
No ratings yet
Chemical Checkpoints: Department of Occupational Safety and Health
13 pages
WWW Ojs - Aaresearchindex
No ratings yet
WWW Ojs - Aaresearchindex
2 pages
Template For Parameter Estimation With Matlab Optimization Toolbox PDF
No ratings yet
Template For Parameter Estimation With Matlab Optimization Toolbox PDF
8 pages
Advertising A Cultural Economy 1st Edition Liz Mcfall download pdf
100% (7)
Advertising A Cultural Economy 1st Edition Liz Mcfall download pdf
61 pages
Confidence Intervals - Introduction: Week 5 1
No ratings yet
Confidence Intervals - Introduction: Week 5 1
11 pages
Heeia State Park, New Lease
No ratings yet
Heeia State Park, New Lease
2 pages
Chapter 5 Risk Identification
No ratings yet
Chapter 5 Risk Identification
17 pages
N 456
No ratings yet
N 456
282 pages
Thionline 1 Chucnangcuadanhdongtu
No ratings yet
Thionline 1 Chucnangcuadanhdongtu
4 pages

CS-DM Module-2

Uploaded by

CS-DM Module-2

Uploaded by

UNIT-2

Data, Measuring Data Similarity and Dissimilarity

Preprocessing Real-world databases are highly susceptible to noisy, missing, and

Data Preprocessing Techniques

* Data transformations, such as normalization, may be applied. Need for preprocessing 

 Attributes of interest may not always be available

 Relevant data may not be recorded due to misunderstanding, or because of equipment

 The data collection instruments used may be faulty.

 Errors in data transmission can also occur.

 Data transformation is a kind of operations, such as normalization and aggregation, are

Example 1: Partition into (equal-frequency) bins:

Smoothing by bin means:

For example, the mean of the values 3,7&14 in bin 1 is 8[(3+7+14)/3]

Smoothing by bin boundaries:

Smoothing by bin means:

Smoothing by bin boundaries:

Discrepancy detection Tools:

Issues in Data Integration

a) Schema integration & object matching.

c) Detection & Resolution of data value conflict

Data transformation is an essential data preprocessing technique that must be performed

Data preprocessing is a broad area and consists of a number of different

• Feature subset selection

• Discretization and binarization

Need for preprocessing

databases and data warehouses.

 Attributes of interest may not always be available

 Relevant data may not be recorded due to misunderstanding, or because of equipment

 Errors in data transmission can also occur.

synchronized data transfer and consumption.

There are several motivations for aggregation.

Second, aggregation can act as a change of scope or scale by providing a high-level

Example: (Australian Precipitation). This example is based on precipitation in Australia from t he

 Dimensionality Reduction: There are a variety of benefits to dimensionality reduction . A

(2) orthogonal (perpendicular) to each other, and

Feature Weighting : Feature weighting is an alternative to keeping or eliminating features.

Discretization and Binarization:

Discretization of Continuous Attributes Discretization is typically applied to attributes that are

Unsupervised Discretization: A basic distinction between discretization methods for

Variable Transformation : A variable transformation refers to a transformation that is

Simple Function Transformations: For this type of variable transformation, a simple

Normalization or Standardization : Another common type of variable transformation is the

Measures of Similarity and Dissimilarity:

Transformations : Transformations are often applied to convert a similarity to a dissimilarity, or

Similarity and Dissimilarity between Simple Attributes : The proximity of objects

Common Properties of Dissimilarity Measures

 The Euclidean distance between the ith and jth objects is

 For every pair(i, j) of observations. The weighted Euclidean distance is

 If scales of the attributes differ substantially, standardization is necessary.

 With the measurement ,xik,i=1,…,N, k=1,… ,p ,the Minkowski distance is

 λ=1:L1 metric, Man hattan or City-block distance. λ = 2 : L2metric, Euclidean distance.

 λ→∞: L∞metric, Supremum distance

 The Mahalanobis distance is

 Where ∑ is the p×p sample covariance matrix.y is 1

f10 = the number of attributes where x is 1 and y is 0

f11 = the number of attributes where x is 1 and y is 1

coefficient (SMC) , which is defined as

Common Properties of Similarity Measures

1.s(p,q)=1(or maximum similarity) only if p =q,

Similarity Measures for Binary Data:

f00 = the number of attributes where x is 0 and y is 0

f01 = the number of attributes where x is 0 and y is 1

f10 = the number of attributes where x is 1 and y is 0

f11 = the number of attributes where x is 1 and y is 1

Issues in Proximity Calculation :

This section discusses several important issues related to proximity measures:

Combining Similarities for Heterogeneous Attributes :

If the weights wk sum to 1, then it becomes

some common issues associated with proximity measures:

Selection of Right Proximity Measure :

Selecting the right proximity measures, also known as similarity or distance

When selecting a proximity measure, consider the following factors:

You might also like