Distributed approximate spectral clustering for large scale datasets

Distributed
Approximate Spectral
Clustering for Large-
Scale Datasets
FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA
PRESENTED BY : BITA KAZEMI ZAHRANI
1

Outline
 Introduction
 Terminologies
 Related Works
 Distributed Approximate Spectral Clustering (DASC)
 Analysis
 Evaluations
 Conclusion
 Summary
 References
2

The idea ….
 Challenge : High time and Space complexity in Machine Learning
algorithms
 Idea : propose a novel kernel-based machine learning algorithm to
process large datasets
 Challenge : kernel-based algorithms suffer from scalability O(n2)
 Claim : reduction in computation and memory with insignificant
accuracy loss and providing a control over accuracy and reduction
tradeoff
 Developed a spectral clustering algorithm
 Deployed on Amazon Elastic MapReduce service
3

Terminologies
 Kernel based methods : one of the methods for pattern analysis,
 Based on kernel methods : analyzing raw representation of data by
a similarity function over pairs instead of a feature vector
 In another word manipulating data or looking at data from another
dimension
 Spectral clustering : using eigenvalues( characteristic values) of
similarity matrix
4

Introduction
 Proliferation of data in all fields
 Use of Machine Learning in science and engineering
 The paper introduces the class of Machine Learning Algorithms
efficient for large datasets
 Two main techniques :
 Controlled Approximation ( tradeoff between accuracy and reduction)
 Elastic distribution of computation ( machine learning technique
harnessing the distributed flexibility offered by clouds)
 THE PERFORMANCE OF KERNEL-BASED ALGORITHMS IMPROVES
WHEN SIZE OF TRAINING DATASET INCREASES!
5

Introduction
 Kernel-based algorithms -> O(n2) similarity matrix
 Idea : design an approximation algorithm for constructing the kernel
matrix for large datasets
 The steps :
I. Design an approximation algorithm for kernel matrix computation
scalable to all ML algorithms
II. Design a Spectral Classification on top of that to show the effect
on accuracy
III. Implemented with MapReduce tested on lab cluster and EMR
IV. Experiments on both real and synthetic data
6

Distributed Approximate Spectral
Clustering (DASC)
 Overview
 Details
 Mapreduce Implementation
7

Overview
 Kernel-based algorithms : very popular in the past decade
 Classification
 Dimensionality reduction
 Data clustering
 Compute kernel matrix i.e kernelized distance between data pair
 1. create compact signatures for all data points
 2. group the data with similar signature ( preprocessing)
 3. compute similarity values for data in each group to construct similarity
matrix
 4. perform Spectral Clustering on the appx’s similarity matrix (SC can be
replaces by any other kernel base approach)
8

Detail
 The first step is preprocessing with Locality Sensitive Hashing(LSH)
 LSH -> probability reduction technique
 Dimension reduction technique : hash input, map similar items to the
same bucket, reduce the size of input universe
The idea : if distance of two points is less than a
Threshold with probability p1, and distance of those
Points is higher than the threshold by appx factor of c
Then group them together or output the same hash!!
Points with higher collision fall in the same bucket!
9
Courtesy: Wikipedia

Details
 LSH : different techniques
 Random projection
 Stable distribution
 ….
 Random projection :
 Choose a random hyperplane
 Designed to appx the cosine distance
 ℎ 𝜈 = 𝑠𝑔𝑛(𝜈. 𝑟) when r is the hyperplane ( h is + or -)
 𝜃 𝑢, 𝑣 𝑡ℎ𝑒 𝑎𝑛𝑔𝑙𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑢 𝑎𝑛𝑑 𝑣
 The formula is an appx of cosine(𝜃 𝑢, 𝑣 )
10
Courtesy: Wikipedia

Details
 Given N data points
 Random projection hashing to produce a M-bit signature fro each data
point
 Choose an arbitrary dimension
 Compare with threshold
 If feature value in the dimension is larger than threshold -> 1
 Otherwise 0
 M arbitrary dimensions, M bit
11
O(MN)

Details
 M-bit signatures
 Near-duplicate signatures ( at least P bit in common )go to the
same bucket
 Let’s say we have T buckets, each bucket Ni points : Ni = N
 𝑖=0
𝑇−1
(𝑁𝑖)2
 Gaussian similarity to compute the distance
between signatures
 Last step running Spectral Clustering on the buckets
12

Detail
 Compute Laplacian Matrix for each similarity matrix Si
 The complexity is 𝑖=0
𝑇−1
(𝑁𝑖)2
 Compute the first K eigenvectors with QR decomposition
 QR : decomposing of a matrix A into a product QR of an orthogonal Q
and an upper triangular matrix R ( O(Ki
3
))
 Reduction of Laplacian to triangular symmetric Ai
 QR becomes O(Ki)
13
O(Ki. 𝑖=0
𝑇−1
(𝑁𝑖)2
)
Adding up :
T DASC = O(M.N) + O(T2) + O(Ki. 𝑖=0
𝑇−1
(𝑁𝑖)2
)

Mapreduce Implementation
 Two MapReduce stages :
 1. applies LSH on input and computes signatures
 Mapper : (index , inputVector) --- > (signature, index)
 Reducer : (signature,index) --- ( signature, listof (index)) [ based on similarity]
 For the approximation P is introduced to restrict the appx
 If P bits out of M are similar : put in the same bucket
 Compare two M-bit signature A and B by bit manipulation :
 A𝑛𝑠 = 𝐴⨁𝐵 𝐴⨁𝐵 − 1
 Ans =0 -> A and B have one bit difference O(1)
 2. Standard MapReduce implementation of spectral clustering in
Mahout library
14

Analysis and Complexity
 Complexity Analysis
 Accuracy Analysis
16

Complexity Analysis
 Computation analysis  Memory analysis
17
Courtesy: reference [1]

Accuracy analysis
 Increasing #hash functions probability of clustering adjacent points
in a bucket decreases
 The incorrectly clustering increases
 Increasing M , collision probability decreases
 Hash function increase : more parallel
 Use M value to partition
 Tuning of M controls the tradeoff
18

Accuracy Analysis
 Randomly selected documents
 Ratio :Correctly classified points to the total number
 DASC better than PSC, close to SC
 NYST on MATLAB [no cluster]
 Clustering didn’t affect accuracy too much
19

Experimental Evaluations
 Platform
 Datasets
 Performance Metrics
 Algorithm Implemented and Compared Agains
 Results for Clustering
 Results for Time and Space Complexities
 Elasticity and Scalability
20

Platform
 Hadoop MapReduce Framework
 Local cluster and Amazon EMR
 Local : 5 machines (1 master/ 4 slaves)
 Intel Core2 Duo E6550
 1GB DRAM
 Hadoop 0.20.2 on Ubunto, Java 1.6.0
 EMR :variable configs : 16,32,64 machines
 1.7GB Memory, 350 GB disk
 Hadoop 0.20.2 on Debian Linux
21

Dataset
 Synthetic
 Real : Wikipedia
22

Job flow on EMR
 Job Flow: collection of steps on a dataset
 First partition dataset into buckets with LSH
 Each bucket stored in a file
 SC applied to each bucket
 Partitions are processed based on mappers available
 Data dependent hashing can be used instead of LSH
 Locality is achieved through the first step LSH
 Highly scalable if nodes are increased or decreased dynamically
allocates fewer or more number of partitions to nodes
23

Algorithms and Performance
metrics
 Computational and space complexity
 Clustering accuracy
 Ratio of correct #clustered points / #total points
 DBI : Davies-Bouldin Index and ASE average squared error
 Fnorm to similarity of original and approximate kernel matrix
 Implemented DASC, PSC and SC, NYST
 DASC modifying Mahout Library M = [(logN)/2−1[ , and P = M-1 to
have O(1) comparison
 SC Mahout
 PSC C++ implementation of SC by PARPACK lib
 NYST implemented by MATLAB
24

Results for Clustering Accuracy
 Similarity between approximate and original Gram matrices
 Using low-level F-norm
 4K to 512K data points
 Not able to use more than 512 K
 Needs square of the input size
 Different number of buckets ! 4 to 4K
 Larger the number of buckets, more partitioning
 Lesser accuracy
 Increase #buckets, decrease F-norm ratio, less similar
 Increase #buckets before Fnorm drops
25

Results for Time and Space
Complexities
 DASC, PSC and SC comparison:
 PSC implemented in C++ with MPI
 DASC implemented in Java with Mapreduce framework
 SC implemented with Mahout library
 Order of magnitude : DASC runs
one order of magnitude faster than PSC
 Mahout SC didn’t scale
to more than 15
26

Conclusion
 Kernel-based Machine Learning
 Approximation for computing kernel matrix
 LSH to reduce the kernel computation
 DASC O(B) reduction in size of problem (B number of buckets)
 B depends on number of bits in signature
 Tested on real and synthetic dataset -> Real : Wikipedia
 90% accuracy
27

Critics
 The approximation factor depends on the size of the problem
 The algorithm is scaled just to kernel-based machine learning
methods which are applicable just to a limited number of datasets
 Algorithms compared are each implemented in different
frameworks, i.e MPI implementation might be really inefficient, thus
the comparison is not really fair!
28

References
 Hefeeda, Mohamed, Fei Gao, and Wael Abd-Almageed.
"Distributed approximate spectral clustering for large-scale
datasets." Proceedings of the 21st international symposium on High-
Performance Parallel and Distributed Computing. ACM, 2012.
 Wikipedia
 Tom M Mitchell slides :
https://www.cs.cmu.edu/~tom/10701_sp11/slides
29

Distributed approximate spectral clustering for large scale datasets

Recommended

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Distributed approximate spectral clustering for large scale datasets (20)

Recently uploaded (20)

Distributed approximate spectral clustering for large scale datasets