SlideShare a Scribd company logo
1
Data Reduction
2
Data Reduction Strategies
 Need for data reduction
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
 Data reduction strategies
 Data cube aggregation
 Attribute Subset Selection
 Numerosity reduction — e.g., fit data into models
 Dimensionality reduction - Data Compression
 Discretization and concept hierarchy generation
3
2. Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the probability
distribution of different classes given the values for those features is
as close as possible to the original distribution given the values of all
features
 reduce # of patterns - easier to understand
 Heuristic methods (due to exponential # of choices):
 Step-wise forward selection
 Step-wise backward elimination
 Combining forward selection and backward elimination
 Decision-tree induction
4
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
5
Heuristic Feature Selection Methods
 There are 2d
possible sub-features of d features
 Several heuristic feature selection methods:
 Best single features under the feature independence assumption:
choose by significance tests
 Best step-wise feature selection
 The best single-feature is picked first
 Then next best feature condition to the first, ...
 Step-wise feature elimination
 Repeatedly eliminate the worst feature
 Best combined feature selection and elimination
6
3. Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods
 Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
 Example: Log-linear models, Regression
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
7
 Regression
 Handles skewed data
 Computationally Intensive
 Log linear models
 Scalability
 Estimate the probability of each point in a multi-dimensional
space based on a smaller subset
 Can construct higher dimensional spaces from lower dimensional
ones
 Both
 Sparse data
Regression and Log-Linear Models
8
Histograms
 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)
 V-optimal: with the least histogram variance (weighted sum of the
original values that each bucket represents)
 MaxDiff: Consider difference between pair of adjacent values. Set
bucket boundary between each pair for pairs having the β (No. of
buckets)–1 largest differences
 Multi-dimensional histogram
Histograms
9
List of prices: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18,
18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30
Histogram with Singleton buckets Equal-width Histogram
10
Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 Index tree / B+ tree – Hierarchical Histogram
986 3396 5411 8392 9544
11
Sampling
 Sampling: obtaining a small sample s to represent the whole data
set N
 Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in the
presence of skew
 Develop adaptive sampling methods
 Stratified sampling
12
Sampling Techniques
 Simple Random Sample Without Replacement (SRSWOR)
 Simple Random Sample With Replacement (SRSWR)
 Cluster Sample
 Stratified Sample
13
Sampling: with or without Replacement
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
14
Cluster Sample
 Tuples are grouped into M mutually disjoint clusters
 SRS of m clusters is taken where m < M
 Tuples in a database retrieved in pages
 Page - Cluster
 SRSWOR to pages
15
Stratified Sample
 Data is divided into mutually disjoint parts called
strata
 SRS at each stratum
 Representative samples ensured even in the
presence of skewed data
16
Cluster and Stratified Sampling
Raw Data
17
Features
 Cost depends on size of sample
 Sub-linear on size of data
 Linear with respect to dimensions
 Estimates answer to an aggregate query
18
Data Compression
 String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless
 But only limited manipulation is possible
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
19
Data Compression
Original Data
Compressed
Data
lossless
Original Data
Approximated
lossy
20
Dimensionality Reduction: Wavelet Transformation
 Discrete wavelet transform (DWT): linear signal processing
 Compressed approximation: store only a small fraction of the strongest of
the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy compression,
localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively
Haar2 Daubechie4
21
Discrete Wavelet Transform
 Can also apply Matrix Multiplication
 Fast DWT algorithm
 Can be applied to Multi-Dimensional data such as Data Cubes
 Wavelet transforms work well on sparse or skewed data
 Used in Computer Vision, Compression of finger print images
22
 Given N data vectors from k-dimensions, find c ≤ k orthogonal vectors
(Principal components) that can be best used to represent data
 Steps
 Normalize input data: Each attribute falls within the same range
 Compute c orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the c principal component
vectors
 The principal components are sorted in order of decreasing “significance” or
strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance. (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
 Works for numeric data only
 Used for handling sparse data
Dimensionality Reduction: Principal
Component Analysis (PCA)
23
X1
X2
Y1
Y2
Principal Component Analysis
Ad

More Related Content

What's hot (20)

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
janani thirupathi
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
Roshan86572
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Khwaja Aamer
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
varshakumar21
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
ANUSUYA T K
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
774474
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
janani thirupathi
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
Roshan86572
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
ANUSUYA T K
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
774474
 

Viewers also liked (18)

Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
Shatakirti Er
 
Datacube
DatacubeDatacube
Datacube
man2sandsce17
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparison
ric_biet
 
Different type of databases
Different type of databasesDifferent type of databases
Different type of databases
Shwe Yee
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 
Clique
Clique Clique
Clique
sk_klms
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssas
Umar Ali
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
Dr Sandeep Kumar Poonia
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
DataminingTools Inc
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
SHIVANI SONI
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
Krish_ver2
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
International School of Engineering
 
OLAP
OLAPOLAP
OLAP
Slideshare
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
Shatakirti Er
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparison
ric_biet
 
Different type of databases
Different type of databasesDifferent type of databases
Different type of databases
Shwe Yee
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssas
Umar Ali
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
DataminingTools Inc
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
SHIVANI SONI
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
Krish_ver2
 
Ad

Similar to 1.7 data reduction (20)

Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs
ShahDhruv21
 
Data reduction
Data reductionData reduction
Data reduction
GowriLatha1
 
Data preparation
Data preparationData preparation
Data preparation
Harry Potter
 
Data preparation
Data preparationData preparation
Data preparation
James Wong
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
Data preperation
Data preperationData preperation
Data preperation
Hoang Nguyen
 
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
 
Data preperation
Data preperationData preperation
Data preperation
Luis Goldster
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Data preparation
Data preparationData preparation
Data preparation
Young Alista
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
Shree Hari
 
Data Mining
Data MiningData Mining
Data Mining
Jay Nagar
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
chatbot9
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
SueMiu
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
sharmila parveen
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs
ShahDhruv21
 
Data preparation
Data preparationData preparation
Data preparation
James Wong
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
Shree Hari
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
chatbot9
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
SueMiu
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Ad

More from Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
Krish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
Krish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
Krish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
Krish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
Krish_ver2
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
Krish_ver2
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
Krish_ver2
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
Krish_ver2
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
Krish_ver2
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
Krish_ver2
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
Krish_ver2
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
Krish_ver2
 
5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
Krish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
Krish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
Krish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
Krish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
Krish_ver2
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
Krish_ver2
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
Krish_ver2
 

Recently uploaded (20)

YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-3-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-3-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
Operations Management (Dr. Abdulfatah Salem).pdf
Operations Management (Dr. Abdulfatah Salem).pdfOperations Management (Dr. Abdulfatah Salem).pdf
Operations Management (Dr. Abdulfatah Salem).pdf
Arab Academy for Science, Technology and Maritime Transport
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
Celine George
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Geography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjectsGeography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjects
ProfDrShaikhImran
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
Anti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptxAnti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptx
Mayuri Chavan
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
Celine George
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Geography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjectsGeography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjects
ProfDrShaikhImran
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
Anti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptxAnti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptx
Mayuri Chavan
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 

1.7 data reduction

  • 2. 2 Data Reduction Strategies  Need for data reduction  A database/data warehouse may store terabytes of data  Complex data analysis/mining may take a very long time to run on the complete data set  Data reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results  Data reduction strategies  Data cube aggregation  Attribute Subset Selection  Numerosity reduction — e.g., fit data into models  Dimensionality reduction - Data Compression  Discretization and concept hierarchy generation
  • 3. 3 2. Attribute Subset Selection  Feature selection (i.e., attribute subset selection):  Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features  reduce # of patterns - easier to understand  Heuristic methods (due to exponential # of choices):  Step-wise forward selection  Step-wise backward elimination  Combining forward selection and backward elimination  Decision-tree induction
  • 4. 4 Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6}
  • 5. 5 Heuristic Feature Selection Methods  There are 2d possible sub-features of d features  Several heuristic feature selection methods:  Best single features under the feature independence assumption: choose by significance tests  Best step-wise feature selection  The best single-feature is picked first  Then next best feature condition to the first, ...  Step-wise feature elimination  Repeatedly eliminate the worst feature  Best combined feature selection and elimination
  • 6. 6 3. Numerosity Reduction  Reduce data volume by choosing alternative, smaller forms of data representation  Parametric methods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Example: Log-linear models, Regression  Non-parametric methods  Do not assume models  Major families: histograms, clustering, sampling
  • 7. 7  Regression  Handles skewed data  Computationally Intensive  Log linear models  Scalability  Estimate the probability of each point in a multi-dimensional space based on a smaller subset  Can construct higher dimensional spaces from lower dimensional ones  Both  Sparse data Regression and Log-Linear Models
  • 8. 8 Histograms  Divide data into buckets and store average (sum) for each bucket  Partitioning rules:  Equal-width: equal bucket range  Equal-frequency (or equal-depth)  V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents)  MaxDiff: Consider difference between pair of adjacent values. Set bucket boundary between each pair for pairs having the β (No. of buckets)–1 largest differences  Multi-dimensional histogram
  • 9. Histograms 9 List of prices: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30 Histogram with Singleton buckets Equal-width Histogram
  • 10. 10 Clustering  Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only  Can be very effective if data is clustered but not if data is “smeared”  Can have hierarchical clustering and be stored in multi- dimensional index tree structures  Index tree / B+ tree – Hierarchical Histogram 986 3396 5411 8392 9544
  • 11. 11 Sampling  Sampling: obtaining a small sample s to represent the whole data set N  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling
  • 12. 12 Sampling Techniques  Simple Random Sample Without Replacement (SRSWOR)  Simple Random Sample With Replacement (SRSWR)  Cluster Sample  Stratified Sample
  • 13. 13 Sampling: with or without Replacement SRSWOR (simple random sample without replacement) SRSWR Raw Data
  • 14. 14 Cluster Sample  Tuples are grouped into M mutually disjoint clusters  SRS of m clusters is taken where m < M  Tuples in a database retrieved in pages  Page - Cluster  SRSWOR to pages
  • 15. 15 Stratified Sample  Data is divided into mutually disjoint parts called strata  SRS at each stratum  Representative samples ensured even in the presence of skewed data
  • 16. 16 Cluster and Stratified Sampling Raw Data
  • 17. 17 Features  Cost depends on size of sample  Sub-linear on size of data  Linear with respect to dimensions  Estimates answer to an aggregate query
  • 18. 18 Data Compression  String compression  There are extensive theories and well-tuned algorithms  Typically lossless  But only limited manipulation is possible  Audio/video compression  Typically lossy compression, with progressive refinement  Sometimes small fragments of signal can be reconstructed without reconstructing the whole
  • 20. 20 Dimensionality Reduction: Wavelet Transformation  Discrete wavelet transform (DWT): linear signal processing  Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients  Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space  Method:  Length, L, must be an integer power of 2 (padding with 0’s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively Haar2 Daubechie4
  • 21. 21 Discrete Wavelet Transform  Can also apply Matrix Multiplication  Fast DWT algorithm  Can be applied to Multi-Dimensional data such as Data Cubes  Wavelet transforms work well on sparse or skewed data  Used in Computer Vision, Compression of finger print images
  • 22. 22  Given N data vectors from k-dimensions, find c ≤ k orthogonal vectors (Principal components) that can be best used to represent data  Steps  Normalize input data: Each attribute falls within the same range  Compute c orthonormal (unit) vectors, i.e., principal components  Each input data (vector) is a linear combination of the c principal component vectors  The principal components are sorted in order of decreasing “significance” or strength  Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data)  Works for numeric data only  Used for handling sparse data Dimensionality Reduction: Principal Component Analysis (PCA)