A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)

A Scalable Implementation
of Deep Learning on Spark
Alexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett Packard Labs 2Databricks
3Huawei & Jules Energy 4Spark community

Outline
– Artificial neural network basics
– Implementation of Multilayer Perceptron (MLP) in Spark
– Optimization & parallelization
– Experiments
– Future work
– What’s new comparing to Spark Summit talk
– Update and more details about parallelization heuristic
– Experiments with larger cluster
– Slide design (now Hewlett Packard Enterprise)

Artificial neural network
– Basics
–Statistical model that approximates a function of multiple inputs
–Consists of interconnected “neurons” which exchange messages
–“Neuron” produces an output by applying a transformation function on its inputs
–Network with more than 3 layers of neurons is called “deep”, instance of deep learning
– Layer types & learning
–A layer type is defined by a transformation function
–Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥 𝑖 −1
, Convolution, Softmax, etc.
–Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid layers
–Model parameters – weights that “neurons” use for transformations
–Parameters are iteratively estimated with the backpropagation algorithm
– Multilayer perceptron
–Speech recognition (phoneme classification), computer vision
–Released in Spark 1.5.0
𝑥
𝑦
input
output
hidden layer

Example of MLP in Spark
–Handwritten digits recognition
–Dataset MNIST [LeCun et al. 1998]
–28x28 greyscale images of handwritten digits 0-9
–MLP with 784 inputs, 10 outputs and two hidden layers
of 300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300 neurons 100 neurons 10 neurons
1st hidden layer 2nd hidden layer Output layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python

Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)
val pca = new PCA()
.setInputCol(“features”)
.setK(20)
.setOutPutCol(“features20”)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(“features20”)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python

MLP implementation in Spark
–Requirements
–Conform to Spark APIs
–Provide extensible interface (deep learning API)
–Efficient and scalable (single node & cluster)
–Why conform to Spark APIs?
–Spark can call any Java, Python or Scala library, not necessary designed for Spark
–Results with expensive data movement from Spark RDD to the library
–Prohibits from using for Spark ML Pipelines
–Extensible interface
–Our implementation processes each layer as a black box with backpropagation in general form
–Allows further introduction of new layers and features
–CNN, (Stacked)Autoencoder, RBM are currently under dev. by community

Efficiency
–Batch processing
–Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊 𝑇
𝒙 + 𝒃
–𝒚 – output from the layer, vector of size 𝑛
–𝑊 – the matrix of layer weights 𝑚 × 𝑛 , 𝒃 – bias, vector of size 𝑛
–𝒙 – input to the layer, vector of size 𝑚
–Vector-matrix multiplications are not as efficient as matrix-matrix
–Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊 𝑇
𝑿 + 𝑩
–𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 ,
–𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃
–We implemented batch processing in matrix form
–Enabled the use of optimized native BLAS libraries
–Memory is reused to limit GC overhead
= * +
= * +

– BLAS in Spark
– BLAS – Basic Linear Algebra Subprograms
– Hardware optimized native in C & Fortran
–CPU: MKL, OpenBLAS etc.
–GPU: NVBLAS (F-BLAS interface to CUDA)
– Use in Spark through Netlib-java
– Experiments
– Huge benefit from native BLAS vs pure Java
f2jblas
– GPU is faster (2x) only for large matrices
–When compute is larger than copy to/from GPU
– More details:
– https://github.com/avulanov/scala-blas
– “linalg: Matrix Computations in Apache Spark” Reza et
al., 2015
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1X1)*(1X1)
(10X10)*(10X1)
(10X10)*(10X10)
(100X100)*(100X1)
(100X100)*(100X10)
(100X100)*(100X100)
(1000X1000)*
(1000X100)
(1000X1000)*
(1000X1000)
(10000X10000)*
(10000X1000)
(10000X10000)*
(10000X10000)
DGEMM PERFORMANCE
netlib-NVBLAS netlib-MKL
netlib OpenBLAS netlib-f2jblas
Single node BLAS
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores
seconds
Matrices sizes

Scalability
Parallelization
– Each iteration 𝑘, each node 𝑖
– 1. Gets parameters 𝑤 𝑘
from master
– 2. Computes a gradient 𝛻𝑖
𝑘
𝐹(𝑑𝑎𝑡𝑎𝑖)
– 3. Sends a gradient to master
– 4. Master computes 𝑤 𝑘+1
based on gradients
– Gradient type
– Batch – process all data on each iteration
– Stochastic – random point
– Mini-batch – random batch
– How many workers to use?
– Less workers – less compute
– More workers – more communication
𝑤 𝑘
𝑤 𝑘+1
≔ 𝑌 𝛻𝑖
𝑘
𝐹
Master
Executor 1
Executor N
Partition 1
Partition 2
Partition P
Executor 1
Executor N
V
V
v
𝛻1
𝑘
𝐹(𝑑𝑎𝑡𝑎1)
𝛻 𝑁
𝑘
𝐹(𝑑𝑎𝑡𝑎 𝑁)
𝛻1
𝑘
𝐹
Master
Executor 1
Executor N
Master V
V
v
1.
2.
3.
4.
GoTo #1

Communication and computation trade-off
Parallelization of batch gradient
– There are 𝑑 data points, 𝑓 features and 𝑘 classes
– Assume, we want to train logistic regression, it has 𝑓𝑘 parameters
– Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with bandwidth 𝑏 and
software overhead 𝑐. Use all-reduce:
– 𝑡 𝑐𝑚 = 2
64𝑓𝑘
𝑏
+ 𝑐 log2 𝑛
– Computation: each worker has 𝑝 FLOPS and processes
𝑑
𝑛
of data, that needs 2𝑓𝑘 operations
– 𝑡 𝑐𝑝~
𝑑
𝑛
2𝑓𝑘
𝑝
– What is the optimal number of workers N?
– min
𝑛
𝑡 𝑐𝑚 + 𝑡 𝑐𝑝 ⇒ 𝑁 = 𝑚𝑎𝑥
2𝑑𝑓𝑘 ln 2
𝑝 128𝑓𝑘 𝑏+2𝑐
, 1
– 𝑁 = 𝑚𝑎𝑥
𝑑∙𝑙∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1 , if 𝑙 is the number of floating point operations

Analysis of the trade-off
Optimal number of workers for batch gradient
– Parallelism in a cluster
𝑑∙𝑙∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1
– Analysis
– More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster
– More operations, i.e. more features and classes (or a deep network) means higher degree
– Small 𝑐 overhead for get/send a message means higher degree
– Example: MNIST8M handwritten digit recognition dataset
– 8.1M documents, 784 features, 10 classes, logistic regression
– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
2∙8.1𝑀∙784∙10∙0.69
32𝐺 128∙784∙10 1𝐺+2∙0.1
, 1 = 12

Artificial neural network case
– Parallelization of batch gradient
– General case
𝑑∙𝑙∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1
– Artificial neural network training:
– Forward pass (each layer matrix-vector multiplication, 2𝑚𝑛): 𝑙 += 2𝑤
– Back propagation (same): 𝑙 += 2𝑤
– Gradient (vector-row matrix multiplication): 𝑙 += 2𝑤
– Total: 𝑙 = 6𝑤
– Artificial neural network prediction:
– Forward pass, 𝑙 = 2𝑤

Comparison with the best case
– What is we can’t get the optimal number of workers?
– After a quick drop, time decreases slowly and starts increasing at some point
– We can use a smaller cluster that will be only 𝑘 times slower than the optimal
– Time: 𝑡 = 2
64𝑤
𝑏
+ 𝑐 log2 𝑛 +
𝑑
𝑛
𝑙
𝑝
= 𝛼 log2 𝑛 +
𝛽
𝑛
– Find the number of nodes that is 𝑘 time slower than the optimal
– 𝛼 log2 𝑛 +
𝛽
𝑛
= 𝑘𝑡 𝑁
– Approximation
– Lets approximate log2 𝑛 with log2 𝑁, substitute 𝑡 𝑁 and solve the equation for 𝑛
– 𝑛 =
𝑁
𝑘−1 ln 𝑁+𝑘
– Also, 𝑘 =
ln 𝑁+
𝑁
𝑛
ln 𝑁+1
(how much is our configuration slower than the optimal)
– Example: Number of nodes that run logistic regression example 10% slower than the optimal configuration
– Optimal number 𝑁 = 12
– 𝑛 =
12
1.1−1 ln 12+1.1
≈ 9

0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13
SPARK MLP VS CAFFE MLP
MLP (total) MLP (compute) Caffe CPU Caffe GPU
Scalability testing
– Setup
– MNIST hw digit recognition 60K samples
– 6-layer MLP-784,2500,2000,1500,1000,500,10
– 12M parameters
– CPU: Xeon E31240, 3.3GHz, 105.6GFLops
– GPU: Tesla M2050 3GB, 575MHz
– Caffe (Deep Learning from Berkeley): 1 node
– Spark: 1 master + 5 workers
– Results per iteration
– Single node (both tools double precision)
– 1.7 slower than Caffe CPU (Scala vs C++)
– Scalability
– 5 nodes give 4.7x speedup, beats Caffe, close to GPU
– 7 nodes on par with GPU by compute
Seconds
Nodes = Workers
Communication
&schedulercost
𝑁 = 𝑚𝑎𝑥
60𝐾 ∙ 6 ∙ 12𝑀 ∙ 0.69
105.6𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1
, 1 = 15
𝑘 =
ln 15 +
15
5
ln 15 + 1
≈ 1.5

Conclusions & future work
– Conclusions
– Scalable multilayer perceptron is available in Spark 1.5.0
– Extensible internal API for Artificial Neural Networks
– Further contributions are welcome!
– Native BLAS (and GPU) speeds up Spark
– Heuristics for parallelization of batch gradient
– Work in progress [SPARK-5575]
– (Stacked)Autoencoder(s)
– Restricted Boltzmann Machines
– Drop-out
– Convolutional neural networks
– Further work
– Adaptive batch LBFGS
– SGD & parameter server

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)

Recommended

More Related Content

What's hot (16)

Viewers also liked (20)

Similar to A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov) (20)

Recently uploaded (20)

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)