0% found this document useful (0 votes)

16 views

Unit-3 Data Reduction

Uploaded by

Adersh Pandey

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Unit-3 Data Reduction

Uploaded by

Adersh Pandey

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Reduction

Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving
the most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant
information.

There are several different data reduction techniques that can be used in data mining, including:

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.

2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.

3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.

4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.

5. Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.

6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.

In conclusion, data reduction is an important step in data mining, as it can help to improve the efficiency
and performance of machine learning algorithms by reducing the size of the dataset. However, it is
important to be aware of the trade-off between the size and accuracy of the data, and carefully assess
the risks and benefits before implementing it.

Methods of data reduction:

These are explained as following below.

1. Data Cube Aggregation:

This technique is used to aggregate data in a simpler form. For example, imagine the information you
gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company
every three months. They involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total sales per year instead
of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for
our analysis. It reduces data size as it eliminates outdated or redundant features.

 Step-wise Forward Selection –

The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.

Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}

Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.

Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}

Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

 Combination of forwarding and Backward Selection –

It allows us to remove the worst and select the best attributes, saving time and making the
process faster.

3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.

 Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.

 Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., the JPEG image format is a lossy compression, but
we can find the meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful enough to retrieve
information from them.

4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model parameter. Or
non-parametric methods such as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:

Techniques of data discretization are used to divide the attributes of the continuous nature into data
with intervals. We replace many constant values of the attributes by labels of small intervals. This means
that mining results are shown in a concise, and easily understandable way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the
whole set of attributes and repeat this method up to the end, then the process is known as top-
down discretization also known as splitting.

 Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.

Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with
high-level concepts (categorical variables such as middle age or Senior).

For numeric data following techniques can be followed:

 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.

 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets. There are several partitioning rules:

1. Equal Frequency partitioning: Partitioning the values based on their number of

occurrences in the data set.

2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20.

3. Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :

Data reduction in data mining can have a number of advantages and disadvantages.

Advantages:

1. Improved efficiency: Data reduction can help to improve the efficiency of machine learning
algorithms by reducing the size of the dataset. This can make it faster and more practical to
work with large datasets.

2. Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This can
help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.

4. Improved interpretability: Data reduction can help to improve the interpretability of the results
by removing irrelevant or redundant information from the dataset.

Disadvantages:

1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.

2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size of
the dataset can also remove important information that is needed for accurate predictions.

3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.

4. Additional computational costs: Data reduction can add additional computational costs to the
data mining process, as it requires additional processing time to reduce the data.

5. In conclusion, data reduction can have both advantages and disadvantages. It can improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it can also result in a loss of information, and make it harder to interpret the results.
It’s important to weigh the pros and cons of data reduction and carefully assess the risks and
benefits before implementing it.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
94% (68)
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
49 pages
Read People Like A Book by Patrick King-Edited
61% (72)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (29)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (70)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Hodder Education Computer Science Study Guide and Notes
100% (1)
Hodder Education Computer Science Study Guide and Notes
201 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Redacted Notes Made Ridiculously Simple
No ratings yet
Redacted Notes Made Ridiculously Simple
11 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Dimensionality reduction
No ratings yet
Dimensionality reduction
7 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
7 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
10 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
5 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Introduction To Dimensionality Reduction
No ratings yet
Introduction To Dimensionality Reduction
5 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
27 pages
Lecture 7 Data Reduction
No ratings yet
Lecture 7 Data Reduction
5 pages
Script
No ratings yet
Script
5 pages
Dimensionality
No ratings yet
Dimensionality
9 pages
Data Mining 11
No ratings yet
Data Mining 11
6 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
ML Unit 4 @ VS
No ratings yet
ML Unit 4 @ VS
33 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
30 pages
Practical 1 ML_removed
No ratings yet
Practical 1 ML_removed
5 pages
insem notes
No ratings yet
insem notes
8 pages
Chương
No ratings yet
Chương
12 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Unit 5 - Machine Learning - Www.a2softech - Xyz - A2kash
No ratings yet
Unit 5 - Machine Learning - Www.a2softech - Xyz - A2kash
12 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
AML Unit 5
No ratings yet
AML Unit 5
13 pages
ML Unit 4
No ratings yet
ML Unit 4
50 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Binning
No ratings yet
Data Binning
9 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
ML Unit 4
No ratings yet
ML Unit 4
34 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
9 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
1M AND 10 M
No ratings yet
1M AND 10 M
23 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in PDF
14 pages
Unit 2
No ratings yet
Unit 2
9 pages
Unit 1 BD PDF
No ratings yet
Unit 1 BD PDF
26 pages
Unit 3
No ratings yet
Unit 3
31 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Chapter6 - Unit IV2024
No ratings yet
Chapter6 - Unit IV2024
84 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Unit 1
No ratings yet
Unit 1
8 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Best Practice Manual For Forensic Image and Video Enhancement
No ratings yet
Best Practice Manual For Forensic Image and Video Enhancement
28 pages
Module 3
No ratings yet
Module 3
23 pages
White Paper: Video Compression For CCTV
No ratings yet
White Paper: Video Compression For CCTV
10 pages
The Croods - Croodlar (2013) 720p BluRay x264 DTS DUAL-HDTURK - MKV
No ratings yet
The Croods - Croodlar (2013) 720p BluRay x264 DTS DUAL-HDTURK - MKV
2 pages
Visdes Final
No ratings yet
Visdes Final
379 pages
Multimedia Communication - ECE - VTU - 8th Sem - Unit 3 - Text and Image Compression, Ramisuniverse
83% (6)
Multimedia Communication - ECE - VTU - 8th Sem - Unit 3 - Text and Image Compression, Ramisuniverse
30 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Ch4 Software
No ratings yet
Ch4 Software
13 pages
(WWW Vtuworld Com) Multimedia-Communication-Notes PDF
No ratings yet
(WWW Vtuworld Com) Multimedia-Communication-Notes PDF
220 pages
Computer Science Notes
No ratings yet
Computer Science Notes
23 pages
Image Compression: CS474/674 - Prof. Bebis
No ratings yet
Image Compression: CS474/674 - Prof. Bebis
110 pages
Concept in Information and Processing
No ratings yet
Concept in Information and Processing
33 pages
Computer Science: Mcqs and Answers
No ratings yet
Computer Science: Mcqs and Answers
4 pages
Multimedia Multimedia Bible
0% (2)
Multimedia Multimedia Bible
163 pages
Programming Real-Time Embedded Systems - EPFL
No ratings yet
Programming Real-Time Embedded Systems - EPFL
40 pages
Deep-Learning Based Lossless Image Coding
No ratings yet
Deep-Learning Based Lossless Image Coding
14 pages
L117, L18, L19, L20, L21 - Module 5 - Source Coding - II
No ratings yet
L117, L18, L19, L20, L21 - Module 5 - Source Coding - II
53 pages
Extra Yr10 Mock Questions 2023
No ratings yet
Extra Yr10 Mock Questions 2023
8 pages
Multimedia 1
No ratings yet
Multimedia 1
6 pages
Lossless Data Compression Using Neural Networks
No ratings yet
Lossless Data Compression Using Neural Networks
5 pages
MQA Technical Analysis Hypotheses Paper
No ratings yet
MQA Technical Analysis Hypotheses Paper
44 pages
Video Steganography PDF
No ratings yet
Video Steganography PDF
5 pages
1 Data Representation - L1 - Introduction To Number System
No ratings yet
1 Data Representation - L1 - Introduction To Number System
50 pages
Chapter-2 Lesson-13 MIL
No ratings yet
Chapter-2 Lesson-13 MIL
5 pages
DM GTU Study Material Presentations Unit-2 17032021053028AM
No ratings yet
DM GTU Study Material Presentations Unit-2 17032021053028AM
60 pages
03 Data Knowledge-Organiser
No ratings yet
03 Data Knowledge-Organiser
2 pages
What Impact Does File Format, Compression Techniques, Image Resolution and Colour Depth Have On File Size and Image Quality?
No ratings yet
What Impact Does File Format, Compression Techniques, Image Resolution and Colour Depth Have On File Size and Image Quality?
6 pages
Image Compression Techniques: H.S Samra
No ratings yet
Image Compression Techniques: H.S Samra
4 pages