Data Mining UNIT II
Data Mining UNIT II
Data preprocessing:- An overview, Data cleaning, Data Integration, Data Reduction, Data
transformation and Data discretization.
Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming,
and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve
the quality of the data and to make it more suitable for the specific data mining task.
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-dimensional space while preserving the
important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning,
and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1
and 1. Normalization is often used to handle data with different units and scales. Common normalization
techniques include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the data
and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset
while preserving the important information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be done
using various techniques such as correlation analysis, mutual information, and principal component
analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as PCA, linear discriminant analysis
(LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used to
reduce the size of the dataset by replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It can
be done using techniques such as wavelet compression, JPEG compression, and gzip compression.
The data to be analyzed by data mining techniques are incomplete i.e. it is lacking attribute values or certain
attributes of interest or containing only aggregate data.
It may be inconsistent i.e. containing discrepancies in the department codes used to categorize items.
Incomplete, noisy and inconsistent data are present in large real world databases and datawarehouses.The data may
be incomplete for many reasons.
a) Attributes of interest may not always be available such as customer information for sales transaction data.
b) Other data may not be included simply, because it was not considered important at the time of entry.
c) Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions.
d) d) Data that were inconsistent with other recorded data may have been deleted.
e) e)The recording of the history or modifications to the data may have been overlooked.
f) Missing data for tuples with missing values for some attributes may need to be inferred.
d) There may be technology limitations ,such as limited buffer size for coordinating synchronized data
transfer and consumption.
e) The incorrect data may be from inconsistencies in naming conventions or data codes used or inconsistent
formats for input fields such as “date". Duplicate tuples also require data cleaning.
Data Cleaning
Data cleaning is an essential step in the data mining process. It is crucial to the construction of a model.
The step that is required, but frequently overlooked by everyone, is data cleaning. The major problem
with quality information management is data quality. Problems with data quality can happen at any place
in an information system. Data cleansing offers a solution to these issues.
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Even if results and algorithms appear to be correct, they are
unreliable if the data is inaccurate. There are numerous ways for data to be duplicated or incorrectly
labeled when merging multiple data sources.
In general, data cleaning lowers errors and raises the caliber of the data. Although it might be a time-
consuming and laborious operation, fixing data mistakes and removing incorrect information must be
done. A crucial method for cleaning up data is data mining. A method for finding useful information in
data is data mining. Data quality mining is a novel methodology that uses data mining methods to find
and fix data quality issues in sizable databases. Data mining mechanically pulls intrinsic and hidden
information from large data sets. Data cleansing can be accomplished using a variety of data mining
approaches.
To arrive at a precise final analysis, it is crucial to comprehend and improve the quality of your data. To
identify key patterns, the data must be prepared. Exploratory data mining is understood. Before doing
business analysis and gaining insights, data cleaning in data mining enables the user to identify erroneous
or missing data.
Data cleaning before data mining is often a time-consuming procedure that necessitates IT personnel to
assist in the initial step of reviewing your data due to how time-consuming data cleaning is. But if your
final analysis is inaccurate or you get an erroneous result, it's possible due to poor data quality.
Steps for Cleaning Data
You can follow these fundamental stages to clean your data even if the techniques employed may vary
depending on the sorts of data your firm stores:
Remove duplicate or pointless observations as well as undesirable observations from your dataset. The
majority of duplicate observations will occur during data gathering. Duplicate data can be produced when
you merge data sets from several sources, scrape data, or get data from clients or other departments. One
of the most important factors to take into account in this procedure is de-duplication. Those observations
are deemed irrelevant when you observe observations that do not pertain to the particular issue you are
attempting to analyze.
You might eliminate those useless observations, for instance, if you wish to analyze data on millennial
clients but your dataset also includes observations from earlier generations. This can improve the
analysis's efficiency, reduce deviance from your main objective, and produce a dataset that is easier to
maintain and use.
When you measure or transfer data and find odd naming practices, typos, or wrong capitalization, such
are structural faults. Mislabelled categories or classes may result from these inconsistencies. For instance,
"N/A" and "Not Applicable" might be present on any given sheet, but they ought to be analyzed under the
same heading.
There will frequently be isolated findings that, at first glance, do not seem to fit the data you are
analyzing. Removing an outlier if you have a good reason to, such as incorrect data entry, will improve
the performance of the data you are working with.
However, occasionally the emergence of an outlier will support a theory you are investigating. And just
because there is an outlier, that doesn't necessarily indicate it is inaccurate. To determine the reliability of
the number, this step is necessary. If an outlier turns out to be incorrect or unimportant for the analysis,
you might want to remove it.
Because many algorithms won't tolerate missing values, you can't overlook missing data. There are a few
options for handling missing data. While neither is ideal, both can be taken into account, for example:
Although you can remove observations with missing values, doing so will result in the loss of
information, so proceed with caution.
Again, there is a chance to undermine the integrity of the data since you can be working from
assumptions rather than actual observations when you input missing numbers based on other
observations.
To browse null values efficiently, you may need to change the way the data is used.
5. Validate and QA
As part of fundamental validation, you ought to be able to respond to the following queries once the data
cleansing procedure is complete:
• Are the data coherent?
• Does the data abide by the regulations that apply to its particular field?
• Does it support or refute your working theory? Does it offer any new information?
• To support your next theory, can you identify any trends in the data?
• If not, is there a problem with the data's quality?
False conclusions can be used to inform poor company strategy and decision-making as a result of
inaccurate or noisy data. False conclusions can result in a humiliating situation in a reporting meeting
when you find out your data couldn't withstand further investigation. Establishing a culture of quality data
in your organization is crucial before you arrive. The tools you might employ to develop this plan should
be documented to achieve this.
The data should be passed through one of the various data-cleaning procedures available. The procedures
are explained below:
1. Ignore the tuples: This approach is not very practical because it is only useful when a tuple has
multiple characteristics and missing values.
2. Fill in the missing value: This strategy is also not very practical or effective. Additionally, it
could be a time-consuming technique. One must add the missing value to the approach. The most
common method for doing this is manually, but other options include using attribute means or the
most likely value.
3. Binning method: This strategy is fairly easy to comprehend. The values nearby are used to
smooth the sorted data. The information is subsequently split into several equal-sized parts. The
various techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is smoothed out. Regression may be
multivariate or linear. Multiple regressions have more independent variables than linear
regressions, which only have one.
5. Clustering: This technique focuses mostly on the group. Data are grouped using clustering. After
that, clustering is used to find the outliers. After that, the comparable values are grouped into a
"group" or "cluster".
The data cleaning method for data mining is demonstrated in the subsequent sections.
1. Monitoring the errors: Keep track of the areas where errors seem to occur most frequently. It
will be simpler to identify and maintain inaccurate or corrupt information. Information is
particularly important when integrating a potential substitute with current management software.
2. Standardize the mining process: To help lower the likelihood of duplicity, standardize the place
of insertion.
3. Validate data accuracy: Analyse the data and spend money on data cleaning software. Artificial
intelligence-based tools were utilized to thoroughly check for accuracy.
4. Scrub for duplicate data: To save time when analyzing data, find duplicates. By analyzing and
investing in independent data-erasing technologies that can analyze imperfect data in quantity and
automate the operation, it is possible to avoid again attempting the same data.
5. Research on data: Our data needs to be vetted, standardized, and duplicate-checked before this
action. There are numerous third-party sources, and these vetted and approved sources can extract
data straight from our databases. They assist us in gathering the data and cleaning it up so that it is
reliable, accurate, and comprehensive for use in business decisions.
6. Communicate with the team: Keeping the group informed will help with client development and
strengthening as well as giving more focused information to potential clients.
Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself or have no
time to clean up all your data sets. You might need to invest in those tools, but it is worth the expenditure.
There are many data cleaning tools in the market. Here are some top-ranked data cleaning tools, such as:
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure
Noisy data:- Noise is a random error or variance in a measured variable .Given a numerical attribute such
as price ,how can we “smooth” out the data to remove the noise?
Bin1 : 4,8,15.
Bin2 : 21,21,24.
Bin3 : 25,28,34.
Bin1: 4,4,15.
Bin2: 21,21,24.
Bin3: 25,25,34.
Binning :- This method smooth a sorted data value by consulting its “neighborhood” i.e. values around it.
As binning methods consult the neighborhood of values they perform local smoothing.
In above example ,the data for price are first sorted and then partitioned into equal –frequency bins of size
3(i.e. each bin contains 3 values).In smoothing by bin means each value in a bin is replaced by the mean
value of the bin.
The mean of values 4,8,15 in bin1 is(4+8+15=27/3=9).Therefore each original values in this bin is
replaced by the value 9.Similarly ,smoothing by bin medians can be employed in which each bin value is
replaced by bin median.
Regression:- Data can be smoothed by fitting the data to a function such as with regression. Regression is
a statistical technique that relates a dependent variable to one or more independent variables. A
regression model is able to show whether changes observed in the dependent variable are associated with
changes in one or more of the independent variables.
In linear regression we find the “best” line to fit two attributes(or variables)so that one attribute can be
used to predict the other. Formulating a regression analysis helps you predict the effects of the
independent variable on the dependent one.
Example: we can say that age and height can be described using a linear regression model. Since a
person's height increases as age increases, they have a linear relationship.
Multiple linear regression:- It is an extension of linear regression ,where more than two attributes are
involved and the data are fit to a multidimensional surface.
3) Clustering:- Values that fall outside of the set of clusters are called outliers. clustering is used to detect
outliers.
Discrepancy detection:- This is the first step in data cleaning as a process. Discrepancies can be caused
by several factors like, poorly designed data entry forms having many optional fields.Human error in data
entry.Deliberate errors(e.g. respondents not wanting to give information about themselves) and Data
decay (e.g. outdated addresses).
Discrepancies are also due to inconsistent data representations and the inconsistent use of codes.Error in
instrumentation devices that record data and system errors.When the data are inadequately used for
purpose it is not intended for then also errors occur.Inconsistencies can be due to data integration (e.g.
when a given attribute can have different names in different databases).
To detect discrepancies use any knowledge regarding properties of the data, such knowledge or “data
about data” is called metadata.
Look for the inconsistent use of codes and any inconsistent data representation (such as “2024/07/25” and
“25/07/2024” for date).
• Unique rule:- This rule says that each value of the given attribute must be different from all other
values for that attribute.(i.e. all tuples for that attribute should have different values no repeated
values allowed.).
• Consecutive rule:-This rule says that there can be no missing values between the lowest and the
highest values for the attribute, and that all values must also be unique.
• Null rule:- This rule specifies the use of blanks ,question marks, special characters or other strings
that may indicate the null condition. The null rule should specify how to record the null condition
.The data should be examined regarding unique rules , consecutive rules and null rules
• Example:- 1) To store zero(0) for numerical attributes.
• 2) A blank for character attributes or any other convention that may in use (such as “don’t know”
or “?” )should be transformed to blank
• For example:- A salesman in a firm or company will have certain commission value, rest all other
employees will have commission as NULL. It should be remembered that NULL is different from
zero(0).
• Some of the reasons for missing values are:-
• 1) The person originally asked to provide a value for the attribute refuses to fill that value and/or
finds that the information requested is not applicable.(e.g. driving license number left blank by
individuals who are not drivers).
• 2) The data entry person does not know the correct value.
Commercial tools can assist in the data transformation step .Data migration tools allow simple
transformations to be specified ,such as to replace the string “gender” by “sex”.
• The two step process of discrepancy detection and data transformation (to correct discrepancies)
iterates. This process is error prone and time consuming.
• Some transformations may introduce more discrepancies. some nested discrepancies may only be
detected after others have been fixed.
• Any tuples that cannot be automatically handled by a given transformation are written to a file
without any explanation regarding the reason behind their failure. The entire data cleaning process
also suffers from the lack of interactivity
• A publicly available data cleaning tool is “potter’s wheel”.
• It integrates discrepancy detection and transformation.
• The tool performs discrepancy checking automatically in the background on the latest transformed
view of the data.
Users can gradually develop and refine transformations, as discrepancies are found ,this leads to more
effective and efficient data cleaning
Data Integration:-
Data integration in data mining refers to the process of combining data from multiple sources into a
single, unified view. This can involve cleaning and transforming the data, as well as resolving any
inconsistencies or conflicts that may exist between the different sources. The goal of data integration is to
make the data more useful and meaningful for the purposes of analysis and decision making. Techniques
used in data integration include data warehousing, ETL (extract, transform, load) processes, and data
federation.
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data
sources into a coherent data store and provides a unified view of the data. These sources may include
multiple data cubes, databases, or flat files.
Data integration is the process of combining data from multiple sources into a cohesive and consistent
view. This process involves identifying and accessing the different data sources, mapping the data to a
common format, and reconciling any inconsistencies or discrepancies between the sources. The goal of
data integration is to make it easier to access and analyze data that is spread across multiple systems or
platforms, in order to gain a more complete and accurate understanding of the data.
Data integration can be challenging due to the variety of data formats, structures, and semantics used
by different data sources. Different data sources may use different data types, naming conventions, and
schemas, making it difficult to combine the data into a single view. Data integration typically involves a
combination of manual and automated processes, including data profiling, data mapping, data
transformation, and data reconciliation.
There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same data,
making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or schemas,
making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can be
difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be computationally
expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple sources can
be difficult, especially when it comes to ensuring data accuracy, consistency, and timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of the
system.
8. Integration with existing systems: Integrating new data sources with existing systems can be a
complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high, requiring
specialized skills and knowledge.
1) Schema Integration.
2) Redundancy Detection.
3) Resolution of data value conflicts.
These are explained in brief below.
1. Schema Integration:
2. Redundancy Detection:
• An attribute may be redundant if it can be derived or obtained from another attribute or set of
attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.
• Redundancy:-An attribute(such as annual revenue) may be redundant if it can be “derived” from
another attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data
set.
• Some redundancies can be detected by correlation analysis. when 2 attributes are given ,
correlation analysis measures how strongly one attribute implies the other based on the available
data.
• For numerical attributes ,we can evaluate the correlation between two attributes A and B, by
computing the correlation coefficient. the higher the value ,the stronger is the correlation.
• In addition to detecting redundancies between attributes ,duplication should also be detected at
the tuple level.
• The use of de normalized tables also causes redundancy.
• So to avoid redundancy always use normalized tables.
Data transformation in data mining refers to the process of converting raw data into a format that is
suitable for analysis and modeling. The goal of data transformation is to prepare the data for data mining
so that it can be used to extract useful insights and knowledge. Data transformation typically involves
several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the data.
2. Data integration: Combining data from multiple sources, such as databases and spreadsheets,
into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and 1, to
facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant features
or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure that the
data is in a format that is suitable for analysis and modeling, and that it is free of errors and
inconsistencies. Data transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the data to a common range
of values.
The data are transformed in ways that are ideal for mining the data. The data transformation involves
steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms It
allows for highlighting important features present in the dataset. It helps in predicting the patterns. When
collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form. The
concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly dependent
on the quantity and quality of the data used. Gathering accurate data of high quality and a large enough
quantity is necessary to produce relevant results. The collection of data is useful for everything from
decisions concerning financing or business strategy of the product, pricing, operations, and marketing
strategies. For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data
Mining activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes. Also, even if a data mining task can manage a
continuous attribute, it can significantly improve its efficiency by replacing a constant quality attribute
with its discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining process from
the given set of attributes. This simplifies the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical value
(young, old). For example, Categorical attributes, such as house addresses, may be generalized to higher-
level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given range.
Techniques that are used for normalization are:
• Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
o Where v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.
o The min-max normalization maps a value ,V,of A to V’ in the range[new_minA ,
new_maxA] by computing.
o V’=V-minA/maxA-minA(new_maxA– new_minA)+new_minA.
o
o
• Z-Score Normalization:
o In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
o A value, v, of attribute A is normalized to v’ by computing.
o V’=V-(mean of A)/standard deviation of A.
o
o
• Decimal Scaling:
o It normalizes the values of an attribute by changing the position of their decimal points
o The number of points by which the decimal point is moved can be determined by the
absolute maximum value of attribute A.
o A value, v, of attribute A is normalized to v’ by computing
o where j is the smallest integer such that Max(|v’|) < 1.
o Suppose: Values of an attribute P varies from -99 to 99.
o The maximum absolute value of P is 99.
o For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of
integers in the largest number) so that values come out to be as 0.98, 0.97 and so on.
o It normalizes by moving the decimal point of values of attribute A. The number of decimal
points moved depends on the maximum absolute value of A. The value ,V, of A is
normalized to V’ by computing,
o V’=V/10j where j is the smallest integer such that Max(|V’|<1)
Data Reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining,
including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still preserving
the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features into a
single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the size of
the data. The more data is reduced, the less accurate the model will be and the less generalizable it
will be.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Suppose there are the following attributes in the data set in which few attributes are redundant.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
• Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
• Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., the JPEG image format is a lossy compression, but we
can find the meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful enough to retrieve information
from them.
5. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model parameter.
Or non-parametric methods such as clustering, histogram, and sampling.
The data volume can be reduced by choosing alternative smaller forms of data representation.For
this purpose numerosity reduction can be applied.These techniques can be parametric and non
parametric.
Parametric method:- A model is used to estimate the data ,so that only the data parameters are
stored, instead of actual data.
4b)Non-Parametric method:-These methods for storing reduced representation of data are non
parametric methods.
** Log linear models:-They approximate discrete multidimensional probability distribution .Log linear
models are used to estimate the probability of each point in a multidimensional space for a set of
discretized attributes based on a smaller subset of dimensional combinations. This allows a higher
dimensional data space to be constructed from lower dimensional spaces. Log linear models are also
useful for dimensionality reduction and data smoothing.
**Histograms:-They are a popular method of data reduction. It uses binning to approximate data
distribution. The histogram for an attribute A, partitions the data distribution of A into disjoint subsets or
buckets. If each bucket represents only a single attribute-value/frequency pair the buckets are called
singleton buckets.
Histograms:-There are several partitioning rules to determine buckets and partition attribute values.
i)Equal width:- In equal width histogram ,the width of each bucket range is uniform.
ii) Equal frequency (or equidepth):- The buckets are created so that, the frequency of each bucket is
constant(i.e. each bucket contains the same number of contiguous data samples).
iii)V-optimal:- This histogram has the least variance. Histogram variance is a weighted sum of the
original values that each bucket represents, where bucket weight is equal to the number of values in the
bucket.
iv)MaxDiff:- In this histogram we consider the difference between each pair of adjacent values. A bucket
boundary is established between each pair ,for pairs having the B-1 largest differences ,where B is the
number of buckets specified by user.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the
whole set of attributes and repeat this method up to the end, then the process is known as top-
down discretization also known as splitting.
• Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with
high-level concepts (categorical variables such as middle age or Senior).
• Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number
of categorical counterparts depends on the number of bins specified by the user.
• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping similar data together.
Data discretization refers to a method of converting a huge number of data values into smaller ones so
that the evaluation and management of data become easy. In other words, data discretization is a method
of converting attributes values of continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values. For data discretization and the development of idea hierarchy, this technique can also be
used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values
of x numbers into clusters to isolate a computational feature of x.
Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is
done through a supervised procedure. In a numeric attribute discretization, first, you need to select the
attribute that has the least entropy, and then you need to run it with the help of a recursive process. The
recursive process divides it into various discretized disjoint intervals, from top to bottom, using the same
splitting criterion.
Discretizing data by linear regression technique, you can get the best neighboring interval, and then the
large intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a
supervised procedure.
The term hierarchy represents an organizational structure or mapping in which items are ranked according
to their levels of importance. In other words, we can say that a hierarchy concept refers to a sequence of
mappings with a set of more general concepts to complex concepts. It means mapping is done from low-
level concepts to high-level concepts. For example, in computer science, there are different types of
hierarchical systems. A document is placed in a folder in windows at a specific place in the tree structure
is the best example of a computer hierarchical tree model. There are two types of hierarchy: top-down
mapping and the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the bottom
to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the
top to the generalized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the
top to the generalized information.