Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?

Deep Learning
with DL4J on
Apache Spark:
Yeah it's Cool, but
are You Doing it the
Right Way?

Hello!
I am Guglielmo Iozzia
Associate Director – Business Tech Analysis at
Previuosly at
2

MSD in Ireland + 50 years
Approx. 2,000 employees
Five sites: Ballydine, Brinny, Carlow and
Dublin
$2.5 billion investment to date
Approx 50% MSD’s top 20 products
manufactured here
Export to + 60 countries
€6.1 billion turnover in 2017
2017 + 300 jobs & €280m investment
MSD Biotech, Dublin, coming in 2021

Deep Learning
It is a subset of machine learning where artificial neural networks, algorithms
inspired by the human brain, learn from large amounts of data.
4

Some Practical Applications of Deep Learning
× Computer vision
× Text generation
× NLP and NLU
× Autonomous cars
× Robotics
× Gaming
× Quantitative finance
5

DL4J
It is an Open Source, distributed, Deep Learning
framework written for JVM languages.
6

It is integrated with Hadoop and Apache Spark and
can be used on distributed GPUs and CPUs.
7

DL4J Modules
× DataVec
× Arbiter
× NN
× Datasets
× RL4J
× DL4J-Spark
× ND4J
8

DL4J Code Example
9
Training and Evaluation
Network
Configuration

ND4J
It is an Open Source linear algebra and matrix manipulation library which supports n-dimensional arrays and it
is integrated with Apache Hadoop and Spark.
10

Apache Spark
It is a unified analytics engine for large-scale data
processing.
11

Speed
Apache Spark achieves high
performance for both batch
and streaming data, using a
state-of-the-art DAG
scheduler, a query
optimizer, and a physical
execution engine.
Apache Spark
Ease of Use
Write applications quickly in
Java, Scala, Python, R and
SQL.
12

Generality
Spark provides a stack of
libraries that can be
combined seamlessly in the
same application.
Apache Spark
Runs Everywhere
Spark runs on Hadoop,
Apache Mesos, Kubernetes,
standalone or in the cloud. It
can access diverse data
sources.
13

Why Distributed MNN Training with
DL4J and Apache Spark?
Why this is a powerful combination?
14

DL4J + Apache Spark
× DL4J provides high level API to design, configure train and
evaluate MNNs.
× Spark performances are excellent in particular for
ETL/streaming, but in terms of computation, in a MNN
training context, some data transformation/aggregation
need to be done using a low-level language.
× DL4J uses ND4J, which is a C++ library that provides high
level Scala API to developers.
15

Model Parallelization
DL4J + Apache Spark
Data Parallelization
16

So: What could possibly go wrong?
17

Memory Management
And now, for something (a
little bit) different.
18

Memory Utilization at Training Time
19

Memory Management in DL4J
Memory allocations can be managed using two different
approaches:
×JVM GC and WeakReference tracking
×MemoryWorkspaces
The idea behind both is the same:
once a NDArray is no longer required, the off-heap
memory associated with it should be released so that it
can be reused. 20

The difference between the two approaches is:
×JVM GC: when a INDArray is collected by the garbage
collector, its off-heap memory is deallocated, with the
assumption that it is not used elsewhere.
×MemoryWorkspaces: when a INDArray leaves the
workspace scope, its off-heap memory may be reused,
without deallocation and reallocation.
21

Please remember that, when a training process uses
workspaces, in order to get the most
from this approach, periodic GC calls need to be
disabled:
Nd4j.getMemoryManager.togglePeriodicGc(false)
or their frequency needs to be reduced:
val gcInterval = 10000 // In milliseconds
Nd4j.getMemoryManager.setAutoGcWindow(gcInterval)
22

Spark & the DL4J
Web UI
A love/hate relationship.
23

Root Cause and Potential Solutions
Dependencies conflict between the DL4J-UI library and
Apache Spark when running in the same JVM.
Two alternatives are available:
×Collect and save the relevant training stats at runtime,
and then visualize them offline later.
×Execute the UI and use its remote functionality into
separate JVMs (servers). Metrics are uploaded from the
Spark master to the UI server.
25

Serialization & ND4J
Kryo me a river.
26

Data Serialization Options in Spark
Data Serialization is the process of converting the in-
memory objects to another format that can be used
to store or send them over the network.
Two options available in Spark:
×Java (default)
×Kryo
27

OOOps!!!
Kryo doesn’t work well with
off-heap data structures.
28

How to Use Kryo Serialization with ND4J?
×Add the ND4J-Kryo dependency to the project
×Configure the Spark application to use the ND4J Kryo
Registrator:
val sparkConf = new SparkConf
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrator", "org.nd4j.Nd4jRegistrator")
29

Data Locality
How data locality affects
performance?
30

Data Locality in Spark
Data locality in Spark means doing computation on the
node where data resides.
In order to optimize processing tasks, Spark tries to place the
execution code as close as possible to the processed data.
×It tries first to move serialized code to the data.
×Sometimes this isn’t possible and the data must be moved to
the executor.
31

How Spark handles data locality?
×It prefers to schedule all tasks at the best locality level.
×When there is no unprocessed data on any idle executor, it
switches to lower locality levels.
− It can wait until a busy CPU frees up to start a task on
data on the same server.
− It can immediately start a new task in a farther place that
requires moving data there.
32

Spark typically waits a bit in the hopes that a busy CPU
frees up. Once that timeout (default is 3 sec) expires, it
starts moving the data to the free CPU. But:
×Training neural networks with DL4J is computationally
expensive.
×So the Spark default behavior isn’t an ideal fit for maximizing
cluster utilization.
33

Data Locality in Spark and DL4J
During training on Spark, DL4J ensures that there is
exactly one task per executor: so it is always better to
immediately transfer data to a free executor, rather than
waiting for another one to become free. Computation
time is more important than any network transfer
time.
34

Data Locality in Spark and DL4J
Spark provides the spark.locality.wait configuration
property: it is the timeout (in seconds) to wait before
moving data to a free CPU.
So, when submitting the configuration for a DL4J training
app, we have to set the value of the spark.locality.wait
property to 0.
35

Handling Java Objects with Large
Off-heap Components
The Off-heap, again.
36

Spark and Large Off-heap Objects
Spark has problems handling Java objects with large off-
heap components, in particular in caching or persisting
them.
When working with DL4J, this is a frequent case, as
DataSet and NDArray objects are involved.
37

Spark drops part of a RDD based on the estimated size of
that block. It estimates the size of a block depending on
the selected persistence level.
In case of MEMORY_ONLY or
MEMORY_AND_DISK, the estimate is done by
walking the Java object graph.
This process doesn't take into account the off-heap
memory used by DL4J and ND4J, so Spark under-
estimates the true size of objects like DataSets or
NDArrays. 38

When deciding bewteen keeping or dropping blocks,
Spark considers only the amount of heap memory used.
DataSet and NDArray objects have a very small on-heap
size, then Spark will keep too many of them, causing out
of memory issues as off-heap memory becomes
exhausted.
39

It is then good practice using MEMORY_ONLY_SER or
MEMORY_AND_DISK_SER when persisting a
RDD<DataSet> or a RDD<INDArray>.
This way Spark stores blocks on the JVM heap in
serialized form. Because there is no off-heap memory for
the serialized objects, it can accurately estimate their size,
in so avoiding out of memory issues.
40

Image Pipeline Data
Preparation
One Batch, Two Batch
(Penny and Dime)
41

Convolutional Neural Network (CNN)
42
By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

Single Machine
You can use DataVec’s
ImageRecordReader.
Image Pipeline Data Preparation
Spark Cluster
Image preprocessing.
43

Image Pipeline Data Preparation
The Spark strategy assume the images are in subdirectories
based on their class labels. Example:
imageRootDir/car/img0.jpg
imageRootDir/car/img1.jpg
...
imageRootDir/truck/img0.jpg
imageRootDir/truck/img1.jpg
...
imageRootDir/motorbike/img0.jpg
imageRootDir/motorbike/img1.jpg
... 44

Image Pipeline Data Preparation (Spark)
The approach is to preprocess the images into batches of files
(ND4J’s FileBatch objects).
The motivation: the original image files typically use efficient
compression (JPEG, PNG, other) which is much more space
and network efficient than a bitmap representation. However, on
a cluster we want to minimize disk reads due to latency issues
with remote storage – one single file read/transfer is faster than
multiple remote file reads.
45

Step 1 (option 1): Preprocess images locally.
val sourceDirectory = "/home/guglielmo/training_images"
val destinationDirectory = "/home/guglielmo/preprocessed_images"
val batchSize = 32
SparkDataUtils.createFileBatchesLocal(sourceDirectory,
NativeImageLoader.ALLOWED_FORMATS, true, destinationDirectory,
batchSize)
After the preprocessing completes, the destination directory can
be copied to the cluster.
46

Step 1 (option 2): Preprocess images using Spark.
val sourceDirectory = “hdfs:///data/training_images”
val destinationDirectory = “hdfs:///data/preprocessed_images”
val batchSize = 32
SparkDataUtils.createFileBatchesSpark(sourceDirectory, destinationDirectory,
batchSize, sparkContext)
47

Step 2: Create a data loader...
val imageHeightWidth = 64
val imageChannels = 3
val labelMaker:PathLabelGenerator = new ParentPathLabelGenerator
val rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth,
imageChannels, labelMaker)
rr.setLabels(new TinyImageNetDataSetIterator(1).getLabels)
val numClasses = TinyImageNetFetcher.NUM_LABELS
val loader = new RecordReaderFileBatchLoader(rr, minibatch, 1, numClasses)
loader.setPreProcessor(new ImagePreProcessingScaler)
48

Step 2: ...and finally train the model
val trainDataPath = "hdfs:///data/preprocessed_images"
val pathsTrain:JavaRDD<String> = SparkUtils.listPaths(sparkContext,
trainDataPath)
for (i <- 0 until numEpochs) {
sparkNet.fitPaths(pathsTrain, loader)
}
49

All the Details on DL4J and Spark in my
Book
http://tinyurl.com/y9jkvtuy
50

Thanks!
Any questions?
You can find me at
@guglielmoiozzia
https://ie.linkedin.com/in/giozzia
googlielmo.blogspot.com
51

Credits
Special thanks to all the people who made and
released these awesome resources for free:
×Presentation template by SlidesCarnival
52

Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?

Recommended

More Related Content

What's hot (20)

Similar to Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way? (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?

Editor's Notes