SlideShare a Scribd company logo
A Scalable Implementation
of Deep Learning on Spark
Alexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett Packard Labs 2Databricks
3Huawei & Jules Energy 4Spark community
Outline
– Artificial neural network basics
– Implementation of Multilayer Perceptron (MLP) in Spark
– Optimization & parallelization
– Experiments
– Future work
– What’s new comparing to Spark Summit talk
– Update and more details about parallelization heuristic
– Experiments with larger cluster
– Slide design (now Hewlett Packard Enterprise)
Artificial neural network
– Basics
–Statistical model that approximates a function of multiple inputs
–Consists of interconnected “neurons” which exchange messages
–“Neuron” produces an output by applying a transformation function on its inputs
–Network with more than 3 layers of neurons is called “deep”, instance of deep learning
– Layer types & learning
–A layer type is defined by a transformation function
–Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥 𝑖 −1
, Convolution, Softmax, etc.
–Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid layers
–Model parameters – weights that “neurons” use for transformations
–Parameters are iteratively estimated with the backpropagation algorithm
– Multilayer perceptron
–Speech recognition (phoneme classification), computer vision
–Released in Spark 1.5.0
𝑥
𝑦
input
output
hidden layer
Example of MLP in Spark
–Handwritten digits recognition
–Dataset MNIST [LeCun et al. 1998]
–28x28 greyscale images of handwritten digits 0-9
–MLP with 784 inputs, 10 outputs and two hidden layers
of 300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300 neurons 100 neurons 10 neurons
1st hidden layer 2nd hidden layer Output layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python
Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)
val pca = new PCA()
.setInputCol(“features”)
.setK(20)
.setOutPutCol(“features20”)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(“features20”)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python
MLP implementation in Spark
–Requirements
–Conform to Spark APIs
–Provide extensible interface (deep learning API)
–Efficient and scalable (single node & cluster)
–Why conform to Spark APIs?
–Spark can call any Java, Python or Scala library, not necessary designed for Spark
–Results with expensive data movement from Spark RDD to the library
–Prohibits from using for Spark ML Pipelines
–Extensible interface
–Our implementation processes each layer as a black box with backpropagation in general form
–Allows further introduction of new layers and features
–CNN, (Stacked)Autoencoder, RBM are currently under dev. by community
Efficiency
–Batch processing
–Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊 𝑇
𝒙 + 𝒃
–𝒚 – output from the layer, vector of size 𝑛
–𝑊 – the matrix of layer weights 𝑚 × 𝑛 , 𝒃 – bias, vector of size 𝑛
–𝒙 – input to the layer, vector of size 𝑚
–Vector-matrix multiplications are not as efficient as matrix-matrix
–Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊 𝑇
𝑿 + 𝑩
–𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 ,
–𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃
–We implemented batch processing in matrix form
–Enabled the use of optimized native BLAS libraries
–Memory is reused to limit GC overhead
= * +
= * +
– BLAS in Spark
– BLAS – Basic Linear Algebra Subprograms
– Hardware optimized native in C & Fortran
–CPU: MKL, OpenBLAS etc.
–GPU: NVBLAS (F-BLAS interface to CUDA)
– Use in Spark through Netlib-java
– Experiments
– Huge benefit from native BLAS vs pure Java
f2jblas
– GPU is faster (2x) only for large matrices
–When compute is larger than copy to/from GPU
– More details:
– https://github.com/avulanov/scala-blas
– “linalg: Matrix Computations in Apache Spark” Reza et
al., 2015
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1X1)*(1X1)
(10X10)*(10X1)
(10X10)*(10X10)
(100X100)*(100X1)
(100X100)*(100X10)
(100X100)*(100X100)
(1000X1000)*
(1000X100)
(1000X1000)*
(1000X1000)
(10000X10000)*
(10000X1000)
(10000X10000)*
(10000X10000)
DGEMM PERFORMANCE
netlib-NVBLAS netlib-MKL
netlib OpenBLAS netlib-f2jblas
Single node BLAS
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores
seconds
Matrices sizes
Scalability
Parallelization
– Each iteration 𝑘, each node 𝑖
– 1. Gets parameters 𝑤 𝑘
from master
– 2. Computes a gradient 𝛻𝑖
𝑘
𝐹(𝑑𝑎𝑡𝑎𝑖)
– 3. Sends a gradient to master
– 4. Master computes 𝑤 𝑘+1
based on gradients
– Gradient type
– Batch – process all data on each iteration
– Stochastic – random point
– Mini-batch – random batch
– How many workers to use?
– Less workers – less compute
– More workers – more communication
𝑤 𝑘
𝑤 𝑘+1
≔ 𝑌 𝛻𝑖
𝑘
𝐹
Master
Executor 1
Executor N
Partition 1
Partition 2
Partition P
Executor 1
Executor N
V
V
v
𝛻1
𝑘
𝐹(𝑑𝑎𝑡𝑎1)
𝛻 𝑁
𝑘
𝐹(𝑑𝑎𝑡𝑎 𝑁)
𝛻1
𝑘
𝐹
Master
Executor 1
Executor N
Master V
V
v
1.
2.
3.
4.
GoTo #1
Communication and computation trade-off
Parallelization of batch gradient
– There are 𝑑 data points, 𝑓 features and 𝑘 classes
– Assume, we want to train logistic regression, it has 𝑓𝑘 parameters
– Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with bandwidth 𝑏 and
software overhead 𝑐. Use all-reduce:
– 𝑡 𝑐𝑚 = 2
64𝑓𝑘
𝑏
+ 𝑐 log2 𝑛
– Computation: each worker has 𝑝 FLOPS and processes
𝑑
𝑛
of data, that needs 2𝑓𝑘 operations
– 𝑡 𝑐𝑝~
𝑑
𝑛
2𝑓𝑘
𝑝
– What is the optimal number of workers N?
– min
𝑛
𝑡 𝑐𝑚 + 𝑡 𝑐𝑝 ⇒ 𝑁 = 𝑚𝑎𝑥
2𝑑𝑓𝑘 ln 2
𝑝 128𝑓𝑘 𝑏+2𝑐
, 1
– 𝑁 = 𝑚𝑎𝑥
𝑑∙𝑙∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1 , if 𝑙 is the number of floating point operations
Analysis of the trade-off
Optimal number of workers for batch gradient
– Parallelism in a cluster
– 𝑁 = 𝑚𝑎𝑥
𝑑∙𝑙∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1
– Analysis
– More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster
– More operations, i.e. more features and classes (or a deep network) means higher degree
– Small 𝑐 overhead for get/send a message means higher degree
– Example: MNIST8M handwritten digit recognition dataset
– 8.1M documents, 784 features, 10 classes, logistic regression
– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
– 𝑁 = 𝑚𝑎𝑥
2∙8.1𝑀∙784∙10∙0.69
32𝐺 128∙784∙10 1𝐺+2∙0.1
, 1 = 12
Artificial neural network case
– Parallelization of batch gradient
– General case
– 𝑁 = 𝑚𝑎𝑥
𝑑∙𝑙∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1
– Artificial neural network training:
– Forward pass (each layer matrix-vector multiplication, 2𝑚𝑛): 𝑙 += 2𝑤
– Back propagation (same): 𝑙 += 2𝑤
– Gradient (vector-row matrix multiplication): 𝑙 += 2𝑤
– Total: 𝑙 = 6𝑤
– Artificial neural network prediction:
– Forward pass, 𝑙 = 2𝑤
Comparison with the best case
– What is we can’t get the optimal number of workers?
– After a quick drop, time decreases slowly and starts increasing at some point
– We can use a smaller cluster that will be only 𝑘 times slower than the optimal
– Time: 𝑡 = 2
64𝑤
𝑏
+ 𝑐 log2 𝑛 +
𝑑
𝑛
𝑙
𝑝
= 𝛼 log2 𝑛 +
𝛽
𝑛
– Find the number of nodes that is 𝑘 time slower than the optimal
– 𝛼 log2 𝑛 +
𝛽
𝑛
= 𝑘𝑡 𝑁
– Approximation
– Lets approximate log2 𝑛 with log2 𝑁, substitute 𝑡 𝑁 and solve the equation for 𝑛
– 𝑛 =
𝑁
𝑘−1 ln 𝑁+𝑘
– Also, 𝑘 =
ln 𝑁+
𝑁
𝑛
ln 𝑁+1
(how much is our configuration slower than the optimal)
– Example: Number of nodes that run logistic regression example 10% slower than the optimal configuration
– Optimal number 𝑁 = 12
– 𝑛 =
12
1.1−1 ln 12+1.1
≈ 9
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13
SPARK MLP VS CAFFE MLP
MLP (total) MLP (compute) Caffe CPU Caffe GPU
Scalability testing
– Setup
– MNIST hw digit recognition 60K samples
– 6-layer MLP-784,2500,2000,1500,1000,500,10
– 12M parameters
– CPU: Xeon E31240, 3.3GHz, 105.6GFLops
– GPU: Tesla M2050 3GB, 575MHz
– Caffe (Deep Learning from Berkeley): 1 node
– Spark: 1 master + 5 workers
– Results per iteration
– Single node (both tools double precision)
– 1.7 slower than Caffe CPU (Scala vs C++)
– Scalability
– 5 nodes give 4.7x speedup, beats Caffe, close to GPU
– 7 nodes on par with GPU by compute
Seconds
Nodes = Workers
Communication
&schedulercost
𝑁 = 𝑚𝑎𝑥
60𝐾 ∙ 6 ∙ 12𝑀 ∙ 0.69
105.6𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1
, 1 = 15
𝑘 =
ln 15 +
15
5
ln 15 + 1
≈ 1.5
Conclusions & future work
– Conclusions
– Scalable multilayer perceptron is available in Spark 1.5.0
– Extensible internal API for Artificial Neural Networks
– Further contributions are welcome!
– Native BLAS (and GPU) speeds up Spark
– Heuristics for parallelization of batch gradient
– Work in progress [SPARK-5575]
– (Stacked)Autoencoder(s)
– Restricted Boltzmann Machines
– Drop-out
– Convolutional neural networks
– Further work
– Adaptive batch LBFGS
– SGD & parameter server
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thankyou
Ad

Recommended

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
MLconf
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
Asim Jalis
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
Varad Meru
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Spark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
Sri Ambati
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
alpinedatalabs
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and Operability
Nicolas Motte
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
Datalayer
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 

More Related Content

What's hot (16)

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
Varad Meru
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Spark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
Sri Ambati
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
Varad Meru
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Spark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
Sri Ambati
 

Viewers also liked (20)

Alpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
alpinedatalabs
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and Operability
Nicolas Motte
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
Datalayer
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
DataWorks Summit
 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
MongoDB
 
Spark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
Zhijie Shen
 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Lambda Architecture with Spark
Lambda Architecture with Spark
Knoldus Inc.
 
Spark on YARN
Spark on YARN
Adarsh Pannu
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
Terence Yim
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Apache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
alpinedatalabs
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and Operability
Nicolas Motte
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
Datalayer
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
DataWorks Summit
 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
MongoDB
 
Spark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
Zhijie Shen
 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Lambda Architecture with Spark
Lambda Architecture with Spark
Knoldus Inc.
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
Terence Yim
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Apache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Ad

Similar to A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov) (20)

A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Spark Summit
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
Deep learning with kafka
Deep learning with kafka
Nitin Kumar
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
AI Frontiers
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
My Master's Thesis
My Master's Thesis
Humoyun Ahmedov
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
Apache MXNet
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
Bikramjit Chowdhury
 
Meetup deeplearningitalia-milano-valerio-morfino
Meetup deeplearningitalia-milano-valerio-morfino
Deep Learning Italia
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 
Netflix machine learning
Netflix machine learning
Amer Ather
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Badri Narayan Bhaskar
 
How to Build your First Neural Network
How to Build your First Neural Network
Hichem Felouat
 
Deep learning on spark
Deep learning on spark
Satyendra Rana
 
Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101
Gerold Bausch
 
Spark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
08 neural networks
08 neural networks
ankit_ppt
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Spark Summit
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
Deep learning with kafka
Deep learning with kafka
Nitin Kumar
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
AI Frontiers
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
Apache MXNet
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
Bikramjit Chowdhury
 
Meetup deeplearningitalia-milano-valerio-morfino
Meetup deeplearningitalia-milano-valerio-morfino
Deep Learning Italia
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 
Netflix machine learning
Netflix machine learning
Amer Ather
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Badri Narayan Bhaskar
 
How to Build your First Neural Network
How to Build your First Neural Network
Hichem Felouat
 
Deep learning on spark
Deep learning on spark
Satyendra Rana
 
Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101
Gerold Bausch
 
Spark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
08 neural networks
08 neural networks
ankit_ppt
 
Ad

Recently uploaded (20)

Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
deep_learning_presentation related to llm
deep_learning_presentation related to llm
sayedabdussalam11
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
Taqyea
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
THE LINEAR REGRESSION MODEL: AN OVERVIEW
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Ameya Patekar
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
deep_learning_presentation related to llm
deep_learning_presentation related to llm
sayedabdussalam11
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
Taqyea
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
THE LINEAR REGRESSION MODEL: AN OVERVIEW
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Ameya Patekar
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)

  • 1. A Scalable Implementation of Deep Learning on Spark Alexander Ulanov 1 Joint work with Xiangrui Meng2, Bert Greevenbosch3 With the help from Guoqiang Li4, Andrey Simanovsky1 1Hewlett Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark community
  • 2. Outline – Artificial neural network basics – Implementation of Multilayer Perceptron (MLP) in Spark – Optimization & parallelization – Experiments – Future work – What’s new comparing to Spark Summit talk – Update and more details about parallelization heuristic – Experiments with larger cluster – Slide design (now Hewlett Packard Enterprise)
  • 3. Artificial neural network – Basics –Statistical model that approximates a function of multiple inputs –Consists of interconnected “neurons” which exchange messages –“Neuron” produces an output by applying a transformation function on its inputs –Network with more than 3 layers of neurons is called “deep”, instance of deep learning – Layer types & learning –A layer type is defined by a transformation function –Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥 𝑖 −1 , Convolution, Softmax, etc. –Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid layers –Model parameters – weights that “neurons” use for transformations –Parameters are iteratively estimated with the backpropagation algorithm – Multilayer perceptron –Speech recognition (phoneme classification), computer vision –Released in Spark 1.5.0 𝑥 𝑦 input output hidden layer
  • 4. Example of MLP in Spark –Handwritten digits recognition –Dataset MNIST [LeCun et al. 1998] –28x28 greyscale images of handwritten digits 0-9 –MLP with 784 inputs, 10 outputs and two hidden layers of 300 and 100 neurons val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist") val mlp = new MultilayerPerceptronClassifier() .setLayers(Array(784, 300, 100, 10)) .setBlockSize(128) val model = mlp.fit(digits) 784 inputs 300 neurons 100 neurons 10 neurons 1st hidden layer 2nd hidden layer Output layer digits = sqlContext.read.format("libsvm").load("/data/mnist") mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128) model = mlp.fit(digits) Scala Python
  • 5. Pipeline with PCA+MLP in Spark val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”) val pca = new PCA() .setInputCol(“features”) .setK(20) .setOutPutCol(“features20”) val mlp = new MultilayerPerceptronClassifier() .setFeaturesCol(“features20”) .setLayers(Array(20, 50, 10)) .setBlockSize(128) val pipeline = new Pipeline() .setStages(Array(pca, mlp)) val model = pipeline.fit(digits) digits = sqlContext.read.format("libsvm").load("/data/mnist8m") pca = PCA(inputCol="features", k=20, outputCol="features20") mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10], blockSize=128) pipeline = Pipeline(stages=[pca, mlp]) model = pipeline.fit(digits) Scala Python
  • 6. MLP implementation in Spark –Requirements –Conform to Spark APIs –Provide extensible interface (deep learning API) –Efficient and scalable (single node & cluster) –Why conform to Spark APIs? –Spark can call any Java, Python or Scala library, not necessary designed for Spark –Results with expensive data movement from Spark RDD to the library –Prohibits from using for Spark ML Pipelines –Extensible interface –Our implementation processes each layer as a black box with backpropagation in general form –Allows further introduction of new layers and features –CNN, (Stacked)Autoencoder, RBM are currently under dev. by community
  • 7. Efficiency –Batch processing –Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊 𝑇 𝒙 + 𝒃 –𝒚 – output from the layer, vector of size 𝑛 –𝑊 – the matrix of layer weights 𝑚 × 𝑛 , 𝒃 – bias, vector of size 𝑛 –𝒙 – input to the layer, vector of size 𝑚 –Vector-matrix multiplications are not as efficient as matrix-matrix –Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊 𝑇 𝑿 + 𝑩 –𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 , –𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃 –We implemented batch processing in matrix form –Enabled the use of optimized native BLAS libraries –Memory is reused to limit GC overhead = * + = * +
  • 8. – BLAS in Spark – BLAS – Basic Linear Algebra Subprograms – Hardware optimized native in C & Fortran –CPU: MKL, OpenBLAS etc. –GPU: NVBLAS (F-BLAS interface to CUDA) – Use in Spark through Netlib-java – Experiments – Huge benefit from native BLAS vs pure Java f2jblas – GPU is faster (2x) only for large matrices –When compute is larger than copy to/from GPU – More details: – https://github.com/avulanov/scala-blas – “linalg: Matrix Computations in Apache Spark” Reza et al., 2015 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 (1X1)*(1X1) (10X10)*(10X1) (10X10)*(10X10) (100X100)*(100X1) (100X100)*(100X10) (100X100)*(100X100) (1000X1000)* (1000X100) (1000X1000)* (1000X1000) (10000X10000)* (10000X1000) (10000X10000)* (10000X10000) DGEMM PERFORMANCE netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas Single node BLAS CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores seconds Matrices sizes
  • 9. Scalability Parallelization – Each iteration 𝑘, each node 𝑖 – 1. Gets parameters 𝑤 𝑘 from master – 2. Computes a gradient 𝛻𝑖 𝑘 𝐹(𝑑𝑎𝑡𝑎𝑖) – 3. Sends a gradient to master – 4. Master computes 𝑤 𝑘+1 based on gradients – Gradient type – Batch – process all data on each iteration – Stochastic – random point – Mini-batch – random batch – How many workers to use? – Less workers – less compute – More workers – more communication 𝑤 𝑘 𝑤 𝑘+1 ≔ 𝑌 𝛻𝑖 𝑘 𝐹 Master Executor 1 Executor N Partition 1 Partition 2 Partition P Executor 1 Executor N V V v 𝛻1 𝑘 𝐹(𝑑𝑎𝑡𝑎1) 𝛻 𝑁 𝑘 𝐹(𝑑𝑎𝑡𝑎 𝑁) 𝛻1 𝑘 𝐹 Master Executor 1 Executor N Master V V v 1. 2. 3. 4. GoTo #1
  • 10. Communication and computation trade-off Parallelization of batch gradient – There are 𝑑 data points, 𝑓 features and 𝑘 classes – Assume, we want to train logistic regression, it has 𝑓𝑘 parameters – Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with bandwidth 𝑏 and software overhead 𝑐. Use all-reduce: – 𝑡 𝑐𝑚 = 2 64𝑓𝑘 𝑏 + 𝑐 log2 𝑛 – Computation: each worker has 𝑝 FLOPS and processes 𝑑 𝑛 of data, that needs 2𝑓𝑘 operations – 𝑡 𝑐𝑝~ 𝑑 𝑛 2𝑓𝑘 𝑝 – What is the optimal number of workers N? – min 𝑛 𝑡 𝑐𝑚 + 𝑡 𝑐𝑝 ⇒ 𝑁 = 𝑚𝑎𝑥 2𝑑𝑓𝑘 ln 2 𝑝 128𝑓𝑘 𝑏+2𝑐 , 1 – 𝑁 = 𝑚𝑎𝑥 𝑑∙𝑙∙ln 2 𝑝 128𝑤 𝑏+2𝑐 , 1 , if 𝑙 is the number of floating point operations
  • 11. Analysis of the trade-off Optimal number of workers for batch gradient – Parallelism in a cluster – 𝑁 = 𝑚𝑎𝑥 𝑑∙𝑙∙ln 2 𝑝 128𝑤 𝑏+2𝑐 , 1 – Analysis – More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster – More operations, i.e. more features and classes (or a deep network) means higher degree – Small 𝑐 overhead for get/send a message means higher degree – Example: MNIST8M handwritten digit recognition dataset – 8.1M documents, 784 features, 10 classes, logistic regression – 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s – 𝑁 = 𝑚𝑎𝑥 2∙8.1𝑀∙784∙10∙0.69 32𝐺 128∙784∙10 1𝐺+2∙0.1 , 1 = 12
  • 12. Artificial neural network case – Parallelization of batch gradient – General case – 𝑁 = 𝑚𝑎𝑥 𝑑∙𝑙∙ln 2 𝑝 128𝑤 𝑏+2𝑐 , 1 – Artificial neural network training: – Forward pass (each layer matrix-vector multiplication, 2𝑚𝑛): 𝑙 += 2𝑤 – Back propagation (same): 𝑙 += 2𝑤 – Gradient (vector-row matrix multiplication): 𝑙 += 2𝑤 – Total: 𝑙 = 6𝑤 – Artificial neural network prediction: – Forward pass, 𝑙 = 2𝑤
  • 13. Comparison with the best case – What is we can’t get the optimal number of workers? – After a quick drop, time decreases slowly and starts increasing at some point – We can use a smaller cluster that will be only 𝑘 times slower than the optimal – Time: 𝑡 = 2 64𝑤 𝑏 + 𝑐 log2 𝑛 + 𝑑 𝑛 𝑙 𝑝 = 𝛼 log2 𝑛 + 𝛽 𝑛 – Find the number of nodes that is 𝑘 time slower than the optimal – 𝛼 log2 𝑛 + 𝛽 𝑛 = 𝑘𝑡 𝑁 – Approximation – Lets approximate log2 𝑛 with log2 𝑁, substitute 𝑡 𝑁 and solve the equation for 𝑛 – 𝑛 = 𝑁 𝑘−1 ln 𝑁+𝑘 – Also, 𝑘 = ln 𝑁+ 𝑁 𝑛 ln 𝑁+1 (how much is our configuration slower than the optimal) – Example: Number of nodes that run logistic regression example 10% slower than the optimal configuration – Optimal number 𝑁 = 12 – 𝑛 = 12 1.1−1 ln 12+1.1 ≈ 9
  • 14. 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 SPARK MLP VS CAFFE MLP MLP (total) MLP (compute) Caffe CPU Caffe GPU Scalability testing – Setup – MNIST hw digit recognition 60K samples – 6-layer MLP-784,2500,2000,1500,1000,500,10 – 12M parameters – CPU: Xeon E31240, 3.3GHz, 105.6GFLops – GPU: Tesla M2050 3GB, 575MHz – Caffe (Deep Learning from Berkeley): 1 node – Spark: 1 master + 5 workers – Results per iteration – Single node (both tools double precision) – 1.7 slower than Caffe CPU (Scala vs C++) – Scalability – 5 nodes give 4.7x speedup, beats Caffe, close to GPU – 7 nodes on par with GPU by compute Seconds Nodes = Workers Communication &schedulercost 𝑁 = 𝑚𝑎𝑥 60𝐾 ∙ 6 ∙ 12𝑀 ∙ 0.69 105.6𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1 , 1 = 15 𝑘 = ln 15 + 15 5 ln 15 + 1 ≈ 1.5
  • 15. Conclusions & future work – Conclusions – Scalable multilayer perceptron is available in Spark 1.5.0 – Extensible internal API for Artificial Neural Networks – Further contributions are welcome! – Native BLAS (and GPU) speeds up Spark – Heuristics for parallelization of batch gradient – Work in progress [SPARK-5575] – (Stacked)Autoencoder(s) – Restricted Boltzmann Machines – Drop-out – Convolutional neural networks – Further work – Adaptive batch LBFGS – SGD & parameter server
  • 16. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Thankyou