Unit-3 Data Reduction
Unit-3 Data Reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving
the most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant
information.
There are several different data reduction techniques that can be used in data mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the efficiency
and performance of machine learning algorithms by reducing the size of the dataset. However, it is
important to be aware of the trade-off between the size and accuracy of the data, and carefully assess
the risks and benefits before implementing it.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Suppose there are the following attributes in the data set in which few attributes are redundant.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., the JPEG image format is a lossy compression, but
we can find the meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful enough to retrieve
information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model parameter. Or
non-parametric methods such as clustering, histogram, and sampling.
Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with
high-level concepts (categorical variables such as middle age or Senior).
Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets. There are several partitioning rules:
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20.
Data reduction in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of machine learning
algorithms by reducing the size of the dataset. This can make it faster and more practical to
work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This can
help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of the results
by removing irrelevant or redundant information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size of
the dataset can also remove important information that is needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.
4. Additional computational costs: Data reduction can add additional computational costs to the
data mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it can also result in a loss of information, and make it harder to interpret the results.
It’s important to weigh the pros and cons of data reduction and carefully assess the risks and
benefits before implementing it.