SlideShare a Scribd company logo
Distributed
Approximate Spectral
Clustering for Large-
Scale Datasets
FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA
PRESENTED BY : BITA KAZEMI ZAHRANI
1
Outline
 Introduction
 Terminologies
 Related Works
 Distributed Approximate Spectral Clustering (DASC)
 Analysis
 Evaluations
 Conclusion
 Summary
 References
2
The idea ….
 Challenge : High time and Space complexity in Machine Learning
algorithms
 Idea : propose a novel kernel-based machine learning algorithm to
process large datasets
 Challenge : kernel-based algorithms suffer from scalability O(n2)
 Claim : reduction in computation and memory with insignificant
accuracy loss and providing a control over accuracy and reduction
tradeoff
 Developed a spectral clustering algorithm
 Deployed on Amazon Elastic MapReduce service
3
Terminologies
 Kernel based methods : one of the methods for pattern analysis,
 Based on kernel methods : analyzing raw representation of data by
a similarity function over pairs instead of a feature vector
 In another word manipulating data or looking at data from another
dimension
 Spectral clustering : using eigenvalues( characteristic values) of
similarity matrix
4
Introduction
 Proliferation of data in all fields
 Use of Machine Learning in science and engineering
 The paper introduces the class of Machine Learning Algorithms
efficient for large datasets
 Two main techniques :
 Controlled Approximation ( tradeoff between accuracy and reduction)
 Elastic distribution of computation ( machine learning technique
harnessing the distributed flexibility offered by clouds)
 THE PERFORMANCE OF KERNEL-BASED ALGORITHMS IMPROVES
WHEN SIZE OF TRAINING DATASET INCREASES!
5
Introduction
 Kernel-based algorithms -> O(n2) similarity matrix
 Idea : design an approximation algorithm for constructing the kernel
matrix for large datasets
 The steps :
I. Design an approximation algorithm for kernel matrix computation
scalable to all ML algorithms
II. Design a Spectral Classification on top of that to show the effect
on accuracy
III. Implemented with MapReduce tested on lab cluster and EMR
IV. Experiments on both real and synthetic data
6
Distributed Approximate Spectral
Clustering (DASC)
 Overview
 Details
 Mapreduce Implementation
7
Overview
 Kernel-based algorithms : very popular in the past decade
 Classification
 Dimensionality reduction
 Data clustering
 Compute kernel matrix i.e kernelized distance between data pair
 1. create compact signatures for all data points
 2. group the data with similar signature ( preprocessing)
 3. compute similarity values for data in each group to construct similarity
matrix
 4. perform Spectral Clustering on the appx’s similarity matrix (SC can be
replaces by any other kernel base approach)
8
Detail
 The first step is preprocessing with Locality Sensitive Hashing(LSH)
 LSH -> probability reduction technique
 Dimension reduction technique : hash input, map similar items to the
same bucket, reduce the size of input universe
The idea : if distance of two points is less than a
Threshold with probability p1, and distance of those
Points is higher than the threshold by appx factor of c
Then group them together or output the same hash!!
Points with higher collision fall in the same bucket!
9
Courtesy: Wikipedia
Details
 LSH : different techniques
 Random projection
 Stable distribution
 ….
 Random projection :
 Choose a random hyperplane
 Designed to appx the cosine distance
 ℎ 𝜈 = 𝑠𝑔𝑛(𝜈. 𝑟) when r is the hyperplane ( h is + or -)
 𝜃 𝑢, 𝑣 𝑡ℎ𝑒 𝑎𝑛𝑔𝑙𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑢 𝑎𝑛𝑑 𝑣
 The formula is an appx of cosine(𝜃 𝑢, 𝑣 )
10
Courtesy: Wikipedia
Details
 Given N data points
 Random projection hashing to produce a M-bit signature fro each data
point
 Choose an arbitrary dimension
 Compare with threshold
 If feature value in the dimension is larger than threshold -> 1
 Otherwise 0
 M arbitrary dimensions, M bit
11
O(MN)
Details
 M-bit signatures
 Near-duplicate signatures ( at least P bit in common )go to the
same bucket
 Let’s say we have T buckets, each bucket Ni points : Ni = N
 𝑖=0
𝑇−1
(𝑁𝑖)2
 Gaussian similarity to compute the distance
between signatures
 Last step running Spectral Clustering on the buckets
12
Detail
 Compute Laplacian Matrix for each similarity matrix Si
 The complexity is 𝑖=0
𝑇−1
(𝑁𝑖)2
 Compute the first K eigenvectors with QR decomposition
 QR : decomposing of a matrix A into a product QR of an orthogonal Q
and an upper triangular matrix R ( O(Ki
3
))
 Reduction of Laplacian to triangular symmetric Ai
 QR becomes O(Ki)
13
O(Ki. 𝑖=0
𝑇−1
(𝑁𝑖)2
)
Adding up :
T DASC = O(M.N) + O(T2) + O(Ki. 𝑖=0
𝑇−1
(𝑁𝑖)2
)
Mapreduce Implementation
 Two MapReduce stages :
 1. applies LSH on input and computes signatures
 Mapper : (index , inputVector) --- > (signature, index)
 Reducer : (signature,index) --- ( signature, listof (index)) [ based on similarity]
 For the approximation P is introduced to restrict the appx
 If P bits out of M are similar : put in the same bucket
 Compare two M-bit signature A and B by bit manipulation :
 A𝑛𝑠 = 𝐴⨁𝐵 𝐴⨁𝐵 − 1
 Ans =0 -> A and B have one bit difference O(1)
 2. Standard MapReduce implementation of spectral clustering in
Mahout library
14
15
Courtesy: reference [1]
Analysis and Complexity
 Complexity Analysis
 Accuracy Analysis
16
Complexity Analysis
 Computation analysis  Memory analysis
17
Courtesy: reference [1]
Courtesy: reference [1]
Courtesy: reference [1]
Accuracy analysis
 Increasing #hash functions probability of clustering adjacent points
in a bucket decreases
 The incorrectly clustering increases
 Increasing M , collision probability decreases
 Hash function increase : more parallel
 Use M value to partition
 Tuning of M controls the tradeoff
18
Courtesy: reference [1]
Accuracy Analysis
 Randomly selected documents
 Ratio :Correctly classified points to the total number
 DASC better than PSC, close to SC
 NYST on MATLAB [no cluster]
 Clustering didn’t affect accuracy too much
19
Courtesy: reference [1]
Experimental Evaluations
 Platform
 Datasets
 Performance Metrics
 Algorithm Implemented and Compared Agains
 Results for Clustering
 Results for Time and Space Complexities
 Elasticity and Scalability
20
Platform
 Hadoop MapReduce Framework
 Local cluster and Amazon EMR
 Local : 5 machines (1 master/ 4 slaves)
 Intel Core2 Duo E6550
 1GB DRAM
 Hadoop 0.20.2 on Ubunto, Java 1.6.0
 EMR :variable configs : 16,32,64 machines
 1.7GB Memory, 350 GB disk
 Hadoop 0.20.2 on Debian Linux
21
Courtesy: reference [1]
Dataset
 Synthetic
 Real : Wikipedia
22
Courtesy: reference [1]
Job flow on EMR
 Job Flow: collection of steps on a dataset
 First partition dataset into buckets with LSH
 Each bucket stored in a file
 SC applied to each bucket
 Partitions are processed based on mappers available
 Data dependent hashing can be used instead of LSH
 Locality is achieved through the first step LSH
 Highly scalable if nodes are increased or decreased dynamically
allocates fewer or more number of partitions to nodes
23
Algorithms and Performance
metrics
 Computational and space complexity
 Clustering accuracy
 Ratio of correct #clustered points / #total points
 DBI : Davies-Bouldin Index and ASE average squared error
 Fnorm to similarity of original and approximate kernel matrix
 Implemented DASC, PSC and SC, NYST
 DASC modifying Mahout Library M = [(logN)/2−1[ , and P = M-1 to
have O(1) comparison
 SC Mahout
 PSC C++ implementation of SC by PARPACK lib
 NYST implemented by MATLAB
24
Results for Clustering Accuracy
 Similarity between approximate and original Gram matrices
 Using low-level F-norm
 4K to 512K data points
 Not able to use more than 512 K
 Needs square of the input size
 Different number of buckets ! 4 to 4K
 Larger the number of buckets, more partitioning
 Lesser accuracy
 Increase #buckets, decrease F-norm ratio, less similar
 Increase #buckets before Fnorm drops
25
Courtesy: reference [1]
Results for Time and Space
Complexities
 DASC, PSC and SC comparison:
 PSC implemented in C++ with MPI
 DASC implemented in Java with Mapreduce framework
 SC implemented with Mahout library
 Order of magnitude : DASC runs
one order of magnitude faster than PSC
 Mahout SC didn’t scale
to more than 15
26
Courtesy: reference [1]
Conclusion
 Kernel-based Machine Learning
 Approximation for computing kernel matrix
 LSH to reduce the kernel computation
 DASC O(B) reduction in size of problem (B number of buckets)
 B depends on number of bits in signature
 Tested on real and synthetic dataset -> Real : Wikipedia
 90% accuracy
27
Critics
 The approximation factor depends on the size of the problem
 The algorithm is scaled just to kernel-based machine learning
methods which are applicable just to a limited number of datasets
 Algorithms compared are each implemented in different
frameworks, i.e MPI implementation might be really inefficient, thus
the comparison is not really fair!
28
References
 Hefeeda, Mohamed, Fei Gao, and Wael Abd-Almageed.
"Distributed approximate spectral clustering for large-scale
datasets." Proceedings of the 21st international symposium on High-
Performance Parallel and Distributed Computing. ACM, 2012.
 Wikipedia
 Tom M Mitchell slides :
https://www.cs.cmu.edu/~tom/10701_sp11/slides
29
Ad

More Related Content

What's hot (20)

Green cloud computing
Green cloud computingGreen cloud computing
Green cloud computing
Shreyas Khare
 
Passive House In Depth for Professionals
Passive House In Depth for ProfessionalsPassive House In Depth for Professionals
Passive House In Depth for Professionals
dylamar
 
Solar energy is renewal source of energy by Suraj
Solar energy is renewal source of energy by SurajSolar energy is renewal source of energy by Suraj
Solar energy is renewal source of energy by Suraj
surajsahu199254
 
Green cloud computing
Green cloud computingGreen cloud computing
Green cloud computing
Karishma Patro
 
A Seminar on Cloud Computing
A Seminar on Cloud ComputingA Seminar on Cloud Computing
A Seminar on Cloud Computing
Abdullah Yousafzai
 
CS8078-Green Computing Unit-1
CS8078-Green Computing Unit-1CS8078-Green Computing Unit-1
CS8078-Green Computing Unit-1
Gobinath Subramaniam
 
COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)
COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)
COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)
Steve Lojuntin
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Mainframe
MainframeMainframe
Mainframe
Kanika Kapoor
 
Chapter 2 software process models
Chapter 2   software process modelsChapter 2   software process models
Chapter 2 software process models
Golda Margret Sheeba J
 
Solar space heating and cooling
Solar space heating and coolingSolar space heating and cooling
Solar space heating and cooling
Jay Khaniya
 
Unit 1 - Introduction to Software Engineering.ppt
Unit 1 - Introduction to Software Engineering.pptUnit 1 - Introduction to Software Engineering.ppt
Unit 1 - Introduction to Software Engineering.ppt
DrTThendralCompSci
 
Virtualization in green computing
Virtualization in green computingVirtualization in green computing
Virtualization in green computing
RAHUL SINHA
 
CS8078-Green Computing Notes Unit-2
CS8078-Green Computing Notes Unit-2CS8078-Green Computing Notes Unit-2
CS8078-Green Computing Notes Unit-2
Gobinath Subramaniam
 
Cs8493 unit 1
Cs8493 unit 1Cs8493 unit 1
Cs8493 unit 1
Kathirvel Ayyaswamy
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Solar Thermal Energy
Solar Thermal Energy Solar Thermal Energy
Solar Thermal Energy
Raktim Saikia
 
Object oriented modeling
Object oriented modelingObject oriented modeling
Object oriented modeling
Pooja Dixit
 
Software resuse
Software  resuseSoftware  resuse
Software resuse
Indu Sharma Bhardwaj
 
Tugas virtual memory
Tugas virtual memory Tugas virtual memory
Tugas virtual memory
TarisaRafika
 
Green cloud computing
Green cloud computingGreen cloud computing
Green cloud computing
Shreyas Khare
 
Passive House In Depth for Professionals
Passive House In Depth for ProfessionalsPassive House In Depth for Professionals
Passive House In Depth for Professionals
dylamar
 
Solar energy is renewal source of energy by Suraj
Solar energy is renewal source of energy by SurajSolar energy is renewal source of energy by Suraj
Solar energy is renewal source of energy by Suraj
surajsahu199254
 
COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)
COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)
COP 26 @ Japan Pavilion - Zero Energy Building Development in Malaysiaa (public)
Steve Lojuntin
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Solar space heating and cooling
Solar space heating and coolingSolar space heating and cooling
Solar space heating and cooling
Jay Khaniya
 
Unit 1 - Introduction to Software Engineering.ppt
Unit 1 - Introduction to Software Engineering.pptUnit 1 - Introduction to Software Engineering.ppt
Unit 1 - Introduction to Software Engineering.ppt
DrTThendralCompSci
 
Virtualization in green computing
Virtualization in green computingVirtualization in green computing
Virtualization in green computing
RAHUL SINHA
 
CS8078-Green Computing Notes Unit-2
CS8078-Green Computing Notes Unit-2CS8078-Green Computing Notes Unit-2
CS8078-Green Computing Notes Unit-2
Gobinath Subramaniam
 
Solar Thermal Energy
Solar Thermal Energy Solar Thermal Energy
Solar Thermal Energy
Raktim Saikia
 
Object oriented modeling
Object oriented modelingObject oriented modeling
Object oriented modeling
Pooja Dixit
 
Tugas virtual memory
Tugas virtual memory Tugas virtual memory
Tugas virtual memory
TarisaRafika
 

Viewers also liked (10)

Introduction to Sparse Methods
Introduction to Sparse Methods Introduction to Sparse Methods
Introduction to Sparse Methods
Shadi Nabil Albarqouni
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformations
Davide Eynard
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
A secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloud
A secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloudA secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloud
A secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloud
Pvrtechnologies Nellore
 
Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB
Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB
Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB
MongoDB
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
IGEEKS TECHNOLOGIES
 
Eigen value and eigen vector
Eigen value and eigen vectorEigen value and eigen vector
Eigen value and eigen vector
Rutvij Patel
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML Meetup
Yan Xu
 
Cloud Computing Security Issues
Cloud Computing Security Issues Cloud Computing Security Issues
Cloud Computing Security Issues
Discover Cloud Computing
 
Data security in cloud computing
Data security in cloud computingData security in cloud computing
Data security in cloud computing
Prince Chandu
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformations
Davide Eynard
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
A secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloud
A secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloudA secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloud
A secure-anti-collusion-data-sharing-scheme-for-dynamic-groups-in-the-cloud
Pvrtechnologies Nellore
 
Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB
Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB
Webinar: Adobe Experience Manager Clustering Made Easy on MongoDB
MongoDB
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
IGEEKS TECHNOLOGIES
 
Eigen value and eigen vector
Eigen value and eigen vectorEigen value and eigen vector
Eigen value and eigen vector
Rutvij Patel
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML Meetup
Yan Xu
 
Data security in cloud computing
Data security in cloud computingData security in cloud computing
Data security in cloud computing
Prince Chandu
 
Ad

Similar to Distributed approximate spectral clustering for large scale datasets (20)

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Economakis
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Oikonomakis
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
csandit
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
acijjournal
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Economakis
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Oikonomakis
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
csandit
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
acijjournal
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Ad

Recently uploaded (20)

Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030
Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030
Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030
Industry Experts
 
Important JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must KnowImportant JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must Know
yashikanigam1
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiqLesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
AngelPinedaTaguinod
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
Urban models for professional practice 03
Urban models for professional practice 03Urban models for professional practice 03
Urban models for professional practice 03
DanisseLoiDapdap
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
Digital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdfDigital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdf
ProsenjitMitra9
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
Taqyea
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030
Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030
Carbon Nanomaterials Market Size, Trends and Outlook 2024-2030
Industry Experts
 
Important JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must KnowImportant JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must Know
yashikanigam1
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiqLesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
AngelPinedaTaguinod
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
Urban models for professional practice 03
Urban models for professional practice 03Urban models for professional practice 03
Urban models for professional practice 03
DanisseLoiDapdap
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
Digital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdfDigital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdf
ProsenjitMitra9
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
Taqyea
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 

Distributed approximate spectral clustering for large scale datasets

  • 1. Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1
  • 2. Outline  Introduction  Terminologies  Related Works  Distributed Approximate Spectral Clustering (DASC)  Analysis  Evaluations  Conclusion  Summary  References 2
  • 3. The idea ….  Challenge : High time and Space complexity in Machine Learning algorithms  Idea : propose a novel kernel-based machine learning algorithm to process large datasets  Challenge : kernel-based algorithms suffer from scalability O(n2)  Claim : reduction in computation and memory with insignificant accuracy loss and providing a control over accuracy and reduction tradeoff  Developed a spectral clustering algorithm  Deployed on Amazon Elastic MapReduce service 3
  • 4. Terminologies  Kernel based methods : one of the methods for pattern analysis,  Based on kernel methods : analyzing raw representation of data by a similarity function over pairs instead of a feature vector  In another word manipulating data or looking at data from another dimension  Spectral clustering : using eigenvalues( characteristic values) of similarity matrix 4
  • 5. Introduction  Proliferation of data in all fields  Use of Machine Learning in science and engineering  The paper introduces the class of Machine Learning Algorithms efficient for large datasets  Two main techniques :  Controlled Approximation ( tradeoff between accuracy and reduction)  Elastic distribution of computation ( machine learning technique harnessing the distributed flexibility offered by clouds)  THE PERFORMANCE OF KERNEL-BASED ALGORITHMS IMPROVES WHEN SIZE OF TRAINING DATASET INCREASES! 5
  • 6. Introduction  Kernel-based algorithms -> O(n2) similarity matrix  Idea : design an approximation algorithm for constructing the kernel matrix for large datasets  The steps : I. Design an approximation algorithm for kernel matrix computation scalable to all ML algorithms II. Design a Spectral Classification on top of that to show the effect on accuracy III. Implemented with MapReduce tested on lab cluster and EMR IV. Experiments on both real and synthetic data 6
  • 7. Distributed Approximate Spectral Clustering (DASC)  Overview  Details  Mapreduce Implementation 7
  • 8. Overview  Kernel-based algorithms : very popular in the past decade  Classification  Dimensionality reduction  Data clustering  Compute kernel matrix i.e kernelized distance between data pair  1. create compact signatures for all data points  2. group the data with similar signature ( preprocessing)  3. compute similarity values for data in each group to construct similarity matrix  4. perform Spectral Clustering on the appx’s similarity matrix (SC can be replaces by any other kernel base approach) 8
  • 9. Detail  The first step is preprocessing with Locality Sensitive Hashing(LSH)  LSH -> probability reduction technique  Dimension reduction technique : hash input, map similar items to the same bucket, reduce the size of input universe The idea : if distance of two points is less than a Threshold with probability p1, and distance of those Points is higher than the threshold by appx factor of c Then group them together or output the same hash!! Points with higher collision fall in the same bucket! 9 Courtesy: Wikipedia
  • 10. Details  LSH : different techniques  Random projection  Stable distribution  ….  Random projection :  Choose a random hyperplane  Designed to appx the cosine distance  ℎ 𝜈 = 𝑠𝑔𝑛(𝜈. 𝑟) when r is the hyperplane ( h is + or -)  𝜃 𝑢, 𝑣 𝑡ℎ𝑒 𝑎𝑛𝑔𝑙𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑢 𝑎𝑛𝑑 𝑣  The formula is an appx of cosine(𝜃 𝑢, 𝑣 ) 10 Courtesy: Wikipedia
  • 11. Details  Given N data points  Random projection hashing to produce a M-bit signature fro each data point  Choose an arbitrary dimension  Compare with threshold  If feature value in the dimension is larger than threshold -> 1  Otherwise 0  M arbitrary dimensions, M bit 11 O(MN)
  • 12. Details  M-bit signatures  Near-duplicate signatures ( at least P bit in common )go to the same bucket  Let’s say we have T buckets, each bucket Ni points : Ni = N  𝑖=0 𝑇−1 (𝑁𝑖)2  Gaussian similarity to compute the distance between signatures  Last step running Spectral Clustering on the buckets 12
  • 13. Detail  Compute Laplacian Matrix for each similarity matrix Si  The complexity is 𝑖=0 𝑇−1 (𝑁𝑖)2  Compute the first K eigenvectors with QR decomposition  QR : decomposing of a matrix A into a product QR of an orthogonal Q and an upper triangular matrix R ( O(Ki 3 ))  Reduction of Laplacian to triangular symmetric Ai  QR becomes O(Ki) 13 O(Ki. 𝑖=0 𝑇−1 (𝑁𝑖)2 ) Adding up : T DASC = O(M.N) + O(T2) + O(Ki. 𝑖=0 𝑇−1 (𝑁𝑖)2 )
  • 14. Mapreduce Implementation  Two MapReduce stages :  1. applies LSH on input and computes signatures  Mapper : (index , inputVector) --- > (signature, index)  Reducer : (signature,index) --- ( signature, listof (index)) [ based on similarity]  For the approximation P is introduced to restrict the appx  If P bits out of M are similar : put in the same bucket  Compare two M-bit signature A and B by bit manipulation :  A𝑛𝑠 = 𝐴⨁𝐵 𝐴⨁𝐵 − 1  Ans =0 -> A and B have one bit difference O(1)  2. Standard MapReduce implementation of spectral clustering in Mahout library 14
  • 16. Analysis and Complexity  Complexity Analysis  Accuracy Analysis 16
  • 17. Complexity Analysis  Computation analysis  Memory analysis 17 Courtesy: reference [1] Courtesy: reference [1] Courtesy: reference [1]
  • 18. Accuracy analysis  Increasing #hash functions probability of clustering adjacent points in a bucket decreases  The incorrectly clustering increases  Increasing M , collision probability decreases  Hash function increase : more parallel  Use M value to partition  Tuning of M controls the tradeoff 18 Courtesy: reference [1]
  • 19. Accuracy Analysis  Randomly selected documents  Ratio :Correctly classified points to the total number  DASC better than PSC, close to SC  NYST on MATLAB [no cluster]  Clustering didn’t affect accuracy too much 19 Courtesy: reference [1]
  • 20. Experimental Evaluations  Platform  Datasets  Performance Metrics  Algorithm Implemented and Compared Agains  Results for Clustering  Results for Time and Space Complexities  Elasticity and Scalability 20
  • 21. Platform  Hadoop MapReduce Framework  Local cluster and Amazon EMR  Local : 5 machines (1 master/ 4 slaves)  Intel Core2 Duo E6550  1GB DRAM  Hadoop 0.20.2 on Ubunto, Java 1.6.0  EMR :variable configs : 16,32,64 machines  1.7GB Memory, 350 GB disk  Hadoop 0.20.2 on Debian Linux 21 Courtesy: reference [1]
  • 22. Dataset  Synthetic  Real : Wikipedia 22 Courtesy: reference [1]
  • 23. Job flow on EMR  Job Flow: collection of steps on a dataset  First partition dataset into buckets with LSH  Each bucket stored in a file  SC applied to each bucket  Partitions are processed based on mappers available  Data dependent hashing can be used instead of LSH  Locality is achieved through the first step LSH  Highly scalable if nodes are increased or decreased dynamically allocates fewer or more number of partitions to nodes 23
  • 24. Algorithms and Performance metrics  Computational and space complexity  Clustering accuracy  Ratio of correct #clustered points / #total points  DBI : Davies-Bouldin Index and ASE average squared error  Fnorm to similarity of original and approximate kernel matrix  Implemented DASC, PSC and SC, NYST  DASC modifying Mahout Library M = [(logN)/2−1[ , and P = M-1 to have O(1) comparison  SC Mahout  PSC C++ implementation of SC by PARPACK lib  NYST implemented by MATLAB 24
  • 25. Results for Clustering Accuracy  Similarity between approximate and original Gram matrices  Using low-level F-norm  4K to 512K data points  Not able to use more than 512 K  Needs square of the input size  Different number of buckets ! 4 to 4K  Larger the number of buckets, more partitioning  Lesser accuracy  Increase #buckets, decrease F-norm ratio, less similar  Increase #buckets before Fnorm drops 25 Courtesy: reference [1]
  • 26. Results for Time and Space Complexities  DASC, PSC and SC comparison:  PSC implemented in C++ with MPI  DASC implemented in Java with Mapreduce framework  SC implemented with Mahout library  Order of magnitude : DASC runs one order of magnitude faster than PSC  Mahout SC didn’t scale to more than 15 26 Courtesy: reference [1]
  • 27. Conclusion  Kernel-based Machine Learning  Approximation for computing kernel matrix  LSH to reduce the kernel computation  DASC O(B) reduction in size of problem (B number of buckets)  B depends on number of bits in signature  Tested on real and synthetic dataset -> Real : Wikipedia  90% accuracy 27
  • 28. Critics  The approximation factor depends on the size of the problem  The algorithm is scaled just to kernel-based machine learning methods which are applicable just to a limited number of datasets  Algorithms compared are each implemented in different frameworks, i.e MPI implementation might be really inefficient, thus the comparison is not really fair! 28
  • 29. References  Hefeeda, Mohamed, Fei Gao, and Wael Abd-Almageed. "Distributed approximate spectral clustering for large-scale datasets." Proceedings of the 21st international symposium on High- Performance Parallel and Distributed Computing. ACM, 2012.  Wikipedia  Tom M Mitchell slides : https://www.cs.cmu.edu/~tom/10701_sp11/slides 29