SlideShare a Scribd company logo
Effective and
Unsupervised
Fractal-based
Feature Selection
for Very Large
Datasets
Removing linear and non-linear attribute correlations
Antonio Canabrava Fraideinberze
Jose F Rodrigues-Jr
Robson Leonardo Ferreira Cordeiro
Databases and Images Group
University of São Paulo
São Carlos - SP - Brazil
2
Terabytes?
…
How to analyze that data?
3
Terabytes?
Parallel processing
and dimensionality
reduction, for
sure...
…
How to analyze that data?
How to analyze that data?
4
Terabytes?
, but how to remove
linear and non-linear
attribute correlations,
besides irrelevant
attributes?
…
How to analyze that data?
5
Terabytes?
, and how to reduce
dimensionality without
human supervision
and being task
independent?
…
6
Terabytes?
Curl-Remover
Medium-
dimensionality
…
How to analyze that data?
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
7
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
8
Fundamental Concepts
Fractal Theory
...
...
...
...
9
Fundamental Concepts
Fractal Theory
...
...
...
...
10
Fundamental Concepts
Fractal Theory
11
Fundamental Concepts
Fractal Theory
12
Fundamental Concepts
Fractal Theory
Embedded, Intrinsic and Fractal Correlation Dimension
Fractal Correlation Dimension ≅ Intrinsic Dimension
13
Fundamental Concepts
Fractal Theory
Embedded, Intrinsic and Fractal Correlation Dimension
Embedded dimension ≅ 3
Intrinsic dimension ≅ 1
Embedded dimension ≅ 3
Intrinsic dimension ≅ 2
14
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
15
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
16
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
log(r)
17
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
log(r)
18
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
19
Multidimensional
Quad-tree
[Traina Jr. et al, 2000]
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
20
Related Work
Dimensionality Reduction - Taxonomy 1
Dimensionality
Reduction
Supervised Algorithms
Unsupervised
Algorithms
Principal Component
Analysis
Singular Vector
Decomposition
Fractal Dimension
Reduction
21
Related Work
Dimensionality Reduction - Taxonomy 2
Dimensionality
Reduction
Feature ExtractionFeature Selection
Principal Component
Analysis
Singular Vector
Decomposition
Fractal Dimension
Reduction
EmbeddedFilterWrapper
22
Related Work
23
Terabytes?
Existing methods need supervision,
miss non-linear correlations, cannot
handle Big Data or work for
classification only
…
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
24
General Idea
25
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
General Idea
26
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Builds partial trees
for the full dataset
and for its E
(E-1)-dimensional
projections
General Idea
27
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
TreeID
+
cell
spatial
position
Partial
count of
points
General Idea
28
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Sums partial point
counts and reports
log(r) and log(sum2)
for each tree
General Idea
29
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Computes D2 for
the full dataset and
pD2 for each of its E
(E-1)-dimensional
projections
General Idea
30
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
The least relevant
attribute, i.e., the one
not in the projection
that minimizes
| D2 - pD2 |
General Idea
31
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Spots the second
least relevant
attribute …
General Idea
3 Main Issues
32
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
General Idea
3 Main Issues
33
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
1° Too much data to
be shuffled – one
data pair per cell/tree
General Idea
3 Main Issues
34
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
2° One
data pass
per
irrelevant
attribute
General Idea
3 Main Issues
35
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
3° Not enough
memory for mappers
Proposed Method
Curl-Remover
36
1° Issue - Too much data to be shuffled; one data pair per
cell/tree;
Our solution - Two-phase dimensionality reduction:
a) Serial feature selection in a tiny data sample (one reducer). Used to
speed-up processing only;
b) All mappers project data into a fixed subspace
37
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.Builds/reports N (2 or
3) tree levels of
lowest resolution…
Proposed Method
Curl-Remover
38
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
… plus the points
projected into the M (2
or 3) most relevant
attributes of sample
Proposed Method
Curl-Remover
39
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Builds the full trees from
their low resolution level
cells and the projected
points
Proposed Method
Curl-Remover
40
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
High resolution cells
are never shuffled
Proposed Method
Curl-Remover
41
2° Issue - One data pass per irrelevant attribute;
Our solution – Stores/reads the tree level of highest
resolution, instead of the original data.
42
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Rdb = cost to read dataset;
TWRtree = cost to transfer,
write and read the last tree
level in next reduce step;
If (Rdb > TWRtree)
then writes tree;
Proposed Method
Curl-Remover
43
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
44
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Writes tree’s last level in
HDFS
Proposed Method
Curl-Remover
45
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Reads tree’s last level
from HDFS
Proposed Method
Curl-Remover
46
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
Reads dataset
only twice
Proposed Method
Curl-Remover
47
3° Issue - Not enough memory for mappers;
Our solution – Sorts data in mappers and reports “tree slices”
whenever needed.
48
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Sorts its local points and
builds “tree slices”
monitoring memory
consumption
Proposed Method
Curl-Remover
Proposed Method
Curl-Remover
49
Y
X
Proposed Method
Curl-Remover
50
Reports “tree slices”
with very little overlap
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
51
Evaluation
Datasets
Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non-
linearly correlated. 5 attributes, 1.1 billion points;
Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2
random attributes. 5 attributes, 1.1 billion points;
Yahoo! Network Flows - communication patterns between end-users in the web. 12
attributes, 562 million points;
Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points;
Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5
million points;
Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5
million points.
52
Evaluation
Fractal Dimension
Hepmass
53
Evaluation
Fractal Dimension
Hepmass Duplicated
54
Evaluation
Comparison with sPCA - Classification
55
Evaluation
Comparison with sPCA - Classification
56
8% more accurate,
7.5% faster
Evaluation
Comparison with sPCA
Percentage of Fractal Dimension after selection
57
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
58
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
59
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
60
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
61
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
 Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
62
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
 Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
 Generality - it suits for analytical tasks in general, and not only for
classification;
63
Questions?
robson@icmc.usp.br
Hepmass Duplicated

More Related Content

What's hot (18)

Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
izahn
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
07 learning
07 learning07 learning
07 learning
ankit_ppt
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
Hridyesh Bisht
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Alexandros Karatzoglou
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
IRJET Journal
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
台灣資料科學年會
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
Ashwin Shiv
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
Pranav Challa
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
ankit_ppt
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 
Fuzzy logic member functions
Fuzzy logic member functionsFuzzy logic member functions
Fuzzy logic member functions
Dr. C.V. Suresh Babu
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Intuit Inc.
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
izahn
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
Hridyesh Bisht
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Alexandros Karatzoglou
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
IRJET Journal
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
台灣資料科學年會
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
Ashwin Shiv
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
Pranav Challa
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
ankit_ppt
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Intuit Inc.
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 

Viewers also liked (19)

Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisions
Universidade de São Paulo
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
Universidade de São Paulo
 
Frequency plot and relevance plot to enhance visual data exploration
Frequency plot and relevance plot to enhance visual data explorationFrequency plot and relevance plot to enhance visual data exploration
Frequency plot and relevance plot to enhance visual data exploration
Universidade de São Paulo
 
SuperGraph visualization
SuperGraph visualizationSuperGraph visualization
SuperGraph visualization
Universidade de São Paulo
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approach
Universidade de São Paulo
 
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
Universidade de São Paulo
 
Apresentacao vldb
Apresentacao vldbApresentacao vldb
Apresentacao vldb
Universidade de São Paulo
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical Study
Universidade de São Paulo
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
Universidade de São Paulo
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...
Universidade de São Paulo
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Universidade de São Paulo
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Universidade de São Paulo
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media images
Universidade de São Paulo
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Universidade de São Paulo
 
Graph-based Relational Data Visualization
Graph-based RelationalData VisualizationGraph-based RelationalData Visualization
Graph-based Relational Data Visualization
Universidade de São Paulo
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Universidade de São Paulo
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Universidade de São Paulo
 
Dawarehouse e OLAP
Dawarehouse e OLAPDawarehouse e OLAP
Dawarehouse e OLAP
Universidade de São Paulo
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Universidade de São Paulo
 
Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisions
Universidade de São Paulo
 
Frequency plot and relevance plot to enhance visual data exploration
Frequency plot and relevance plot to enhance visual data explorationFrequency plot and relevance plot to enhance visual data exploration
Frequency plot and relevance plot to enhance visual data exploration
Universidade de São Paulo
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approach
Universidade de São Paulo
 
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
Universidade de São Paulo
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical Study
Universidade de São Paulo
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
Universidade de São Paulo
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...
Universidade de São Paulo
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Universidade de São Paulo
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Universidade de São Paulo
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media images
Universidade de São Paulo
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Universidade de São Paulo
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Universidade de São Paulo
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Universidade de São Paulo
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Universidade de São Paulo
 

Similar to Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations (20)

Data analytics concepts
Data analytics conceptsData analytics concepts
Data analytics concepts
Hiranthi Tennakoon
 
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Parinda Rajapaksha
 
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfvPCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Sravani477269
 
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
ADA_Module 2_MN.pptx Analysis and Design of AlgorithmsADA_Module 2_MN.pptx Analysis and Design of Algorithms
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
madhu614742
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Bhaskar Mitra
 
Deep learning from mashine learning AI..
Deep learning from mashine learning AI..Deep learning from mashine learning AI..
Deep learning from mashine learning AI..
premkumarlive
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
Chamin Nalinda Loku Gam Hewage
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
Collin Bennett
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
hyunsung lee
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
PCA_csep546.pptx
PCA_csep546.pptxPCA_csep546.pptx
PCA_csep546.pptx
KhushiHarsure
 
principle component analysis pca - machine learning - unsupervised learning
principle component analysis pca - machine learning - unsupervised learningprinciple component analysis pca - machine learning - unsupervised learning
principle component analysis pca - machine learning - unsupervised learning
EmanAsem4
 
PCA_csep546.pptx
PCA_csep546.pptxPCA_csep546.pptx
PCA_csep546.pptx
MaTruongThanh002937
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
Sivam Chinna
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filtering
Arthur Mensch
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
recsysfr
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Parinda Rajapaksha
 
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfvPCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Sravani477269
 
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
ADA_Module 2_MN.pptx Analysis and Design of AlgorithmsADA_Module 2_MN.pptx Analysis and Design of Algorithms
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
madhu614742
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Bhaskar Mitra
 
Deep learning from mashine learning AI..
Deep learning from mashine learning AI..Deep learning from mashine learning AI..
Deep learning from mashine learning AI..
premkumarlive
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
Chamin Nalinda Loku Gam Hewage
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
Collin Bennett
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
hyunsung lee
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
principle component analysis pca - machine learning - unsupervised learning
principle component analysis pca - machine learning - unsupervised learningprinciple component analysis pca - machine learning - unsupervised learning
principle component analysis pca - machine learning - unsupervised learning
EmanAsem4
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
Sivam Chinna
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filtering
Arthur Mensch
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
recsysfr
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 

More from Universidade de São Paulo (11)

A gentle introduction to Deep Learning
A gentle introduction to Deep LearningA gentle introduction to Deep Learning
A gentle introduction to Deep Learning
Universidade de São Paulo
 
Computação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalhoComputação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalho
Universidade de São Paulo
 
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopIntrodução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Universidade de São Paulo
 
Metric s plat - a platform for quick development testing and visualization of...
Metric s plat - a platform for quick development testing and visualization of...Metric s plat - a platform for quick development testing and visualization of...
Metric s plat - a platform for quick development testing and visualization of...
Universidade de São Paulo
 
Hierarchical visual filtering pragmatic and epistemic actions for database vi...
Hierarchical visual filtering pragmatic and epistemic actions for database vi...Hierarchical visual filtering pragmatic and epistemic actions for database vi...
Hierarchical visual filtering pragmatic and epistemic actions for database vi...
Universidade de São Paulo
 
Java generics-basics
Java generics-basicsJava generics-basics
Java generics-basics
Universidade de São Paulo
 
Java collections-basic
Java collections-basicJava collections-basic
Java collections-basic
Universidade de São Paulo
 
Java network-sockets-etc
Java network-sockets-etcJava network-sockets-etc
Java network-sockets-etc
Universidade de São Paulo
 
Java streams
Java streamsJava streams
Java streams
Universidade de São Paulo
 
Infovis tutorial
Infovis tutorialInfovis tutorial
Infovis tutorial
Universidade de São Paulo
 
Java platform
Java platformJava platform
Java platform
Universidade de São Paulo
 

Recently uploaded (20)

1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 

Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

  • 1. Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets Removing linear and non-linear attribute correlations Antonio Canabrava Fraideinberze Jose F Rodrigues-Jr Robson Leonardo Ferreira Cordeiro Databases and Images Group University of São Paulo São Carlos - SP - Brazil
  • 3. 3 Terabytes? Parallel processing and dimensionality reduction, for sure... … How to analyze that data?
  • 4. How to analyze that data? 4 Terabytes? , but how to remove linear and non-linear attribute correlations, besides irrelevant attributes? …
  • 5. How to analyze that data? 5 Terabytes? , and how to reduce dimensionality without human supervision and being task independent? …
  • 7. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 7
  • 8. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 8
  • 13. Fundamental Concepts Fractal Theory Embedded, Intrinsic and Fractal Correlation Dimension Fractal Correlation Dimension ≅ Intrinsic Dimension 13
  • 14. Fundamental Concepts Fractal Theory Embedded, Intrinsic and Fractal Correlation Dimension Embedded dimension ≅ 3 Intrinsic dimension ≅ 1 Embedded dimension ≅ 3 Intrinsic dimension ≅ 2 14
  • 15. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting 15
  • 16. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting 16
  • 17. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting log(r) 17
  • 18. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting log(r) 18
  • 19. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting 19 Multidimensional Quad-tree [Traina Jr. et al, 2000]
  • 20. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 20
  • 21. Related Work Dimensionality Reduction - Taxonomy 1 Dimensionality Reduction Supervised Algorithms Unsupervised Algorithms Principal Component Analysis Singular Vector Decomposition Fractal Dimension Reduction 21
  • 22. Related Work Dimensionality Reduction - Taxonomy 2 Dimensionality Reduction Feature ExtractionFeature Selection Principal Component Analysis Singular Vector Decomposition Fractal Dimension Reduction EmbeddedFilterWrapper 22
  • 23. Related Work 23 Terabytes? Existing methods need supervision, miss non-linear correlations, cannot handle Big Data or work for classification only …
  • 24. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 24
  • 25. General Idea 25 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance.
  • 26. General Idea 26 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Builds partial trees for the full dataset and for its E (E-1)-dimensional projections
  • 27. General Idea 27 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. TreeID + cell spatial position Partial count of points
  • 28. General Idea 28 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Sums partial point counts and reports log(r) and log(sum2) for each tree
  • 29. General Idea 29 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Computes D2 for the full dataset and pD2 for each of its E (E-1)-dimensional projections
  • 30. General Idea 30 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. The least relevant attribute, i.e., the one not in the projection that minimizes | D2 - pD2 |
  • 31. General Idea 31 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Spots the second least relevant attribute …
  • 32. General Idea 3 Main Issues 32 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance.
  • 33. General Idea 3 Main Issues 33 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. 1° Too much data to be shuffled – one data pair per cell/tree
  • 34. General Idea 3 Main Issues 34 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. 2° One data pass per irrelevant attribute
  • 35. General Idea 3 Main Issues 35 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. 3° Not enough memory for mappers
  • 36. Proposed Method Curl-Remover 36 1° Issue - Too much data to be shuffled; one data pair per cell/tree; Our solution - Two-phase dimensionality reduction: a) Serial feature selection in a tiny data sample (one reducer). Used to speed-up processing only; b) All mappers project data into a fixed subspace
  • 37. 37 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance.Builds/reports N (2 or 3) tree levels of lowest resolution… Proposed Method Curl-Remover
  • 38. 38 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. … plus the points projected into the M (2 or 3) most relevant attributes of sample Proposed Method Curl-Remover
  • 39. 39 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Builds the full trees from their low resolution level cells and the projected points Proposed Method Curl-Remover
  • 40. 40 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Proposed Method Curl-Remover High resolution cells are never shuffled
  • 41. Proposed Method Curl-Remover 41 2° Issue - One data pass per irrelevant attribute; Our solution – Stores/reads the tree level of highest resolution, instead of the original data.
  • 42. 42 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Rdb = cost to read dataset; TWRtree = cost to transfer, write and read the last tree level in next reduce step; If (Rdb > TWRtree) then writes tree; Proposed Method Curl-Remover
  • 43. 43 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Proposed Method Curl-Remover
  • 44. 44 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Writes tree’s last level in HDFS Proposed Method Curl-Remover
  • 45. 45 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Reads tree’s last level from HDFS Proposed Method Curl-Remover
  • 46. 46 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Proposed Method Curl-Remover Reads dataset only twice
  • 47. Proposed Method Curl-Remover 47 3° Issue - Not enough memory for mappers; Our solution – Sorts data in mappers and reports “tree slices” whenever needed.
  • 48. 48 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Sorts its local points and builds “tree slices” monitoring memory consumption Proposed Method Curl-Remover
  • 50. Proposed Method Curl-Remover 50 Reports “tree slices” with very little overlap
  • 51. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 51
  • 52. Evaluation Datasets Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non- linearly correlated. 5 attributes, 1.1 billion points; Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2 random attributes. 5 attributes, 1.1 billion points; Yahoo! Network Flows - communication patterns between end-users in the web. 12 attributes, 562 million points; Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points; Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5 million points; Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5 million points. 52
  • 55. Evaluation Comparison with sPCA - Classification 55
  • 56. Evaluation Comparison with sPCA - Classification 56 8% more accurate, 7.5% faster
  • 57. Evaluation Comparison with sPCA Percentage of Fractal Dimension after selection 57
  • 58. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 58
  • 59. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA; Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points; Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set; Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 59
  • 60. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points; Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set; Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 60
  • 61. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points;  Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set; Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 61
  • 62. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points;  Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set;  Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 62
  • 63. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points;  Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set;  Semantics - it is a feature selection method, thus maintaining the semantics of the attributes;  Generality - it suits for analytical tasks in general, and not only for classification; 63