SlideShare a Scribd company logo
Deep Learning
with DL4J on
Apache Spark:
Yeah it's Cool, but
are You Doing it the
Right Way?
Hello!
I am Guglielmo Iozzia
Associate Director – Business Tech Analysis at
Previuosly at
2
MSD in Ireland + 50 years
Approx. 2,000 employees
Five sites: Ballydine, Brinny, Carlow and
Dublin
$2.5 billion investment to date
Approx 50% MSD’s top 20 products
manufactured here
Export to + 60 countries
€6.1 billion turnover in 2017
2017 + 300 jobs & €280m investment
MSD Biotech, Dublin, coming in 2021
Deep Learning
It is a subset of machine learning where artificial neural networks, algorithms
inspired by the human brain, learn from large amounts of data.
4
Some Practical Applications of Deep Learning
× Computer vision
× Text generation
× NLP and NLU
× Autonomous cars
× Robotics
× Gaming
× Quantitative finance
5
DL4J
It is an Open Source, distributed, Deep Learning
framework written for JVM languages.
6
It is integrated with Hadoop and Apache Spark and
can be used on distributed GPUs and CPUs.
7
DL4J Modules
× DataVec
× Arbiter
× NN
× Datasets
× RL4J
× DL4J-Spark
× ND4J
8
DL4J Code Example
9
Training and Evaluation
Network
Configuration
ND4J
It is an Open Source linear algebra and matrix manipulation library which supports n-dimensional arrays and it
is integrated with Apache Hadoop and Spark.
10
Apache Spark
It is a unified analytics engine for large-scale data
processing.
11
Speed
Apache Spark achieves high
performance for both batch
and streaming data, using a
state-of-the-art DAG
scheduler, a query
optimizer, and a physical
execution engine.
Apache Spark
Ease of Use
Write applications quickly in
Java, Scala, Python, R and
SQL.
12
Generality
Spark provides a stack of
libraries that can be
combined seamlessly in the
same application.
Apache Spark
Runs Everywhere
Spark runs on Hadoop,
Apache Mesos, Kubernetes,
standalone or in the cloud. It
can access diverse data
sources.
13
Why Distributed MNN Training with
DL4J and Apache Spark?
Why this is a powerful combination?
14
DL4J + Apache Spark
× DL4J provides high level API to design, configure train and
evaluate MNNs.
× Spark performances are excellent in particular for
ETL/streaming, but in terms of computation, in a MNN
training context, some data transformation/aggregation
need to be done using a low-level language.
× DL4J uses ND4J, which is a C++ library that provides high
level Scala API to developers.
15
Model Parallelization
DL4J + Apache Spark
Data Parallelization
16
So: What could possibly go wrong?
17
Memory Management
And now, for something (a
little bit) different.
18
Memory Utilization at Training Time
19
Memory Management in DL4J
Memory allocations can be managed using two different
approaches:
×JVM GC and WeakReference tracking
×MemoryWorkspaces
The idea behind both is the same:
once a NDArray is no longer required, the off-heap
memory associated with it should be released so that it
can be reused. 20
Memory Management in DL4J
The difference between the two approaches is:
×JVM GC: when a INDArray is collected by the garbage
collector, its off-heap memory is deallocated, with the
assumption that it is not used elsewhere.
×MemoryWorkspaces: when a INDArray leaves the
workspace scope, its off-heap memory may be reused,
without deallocation and reallocation.
21
Memory Management in DL4J
Please remember that, when a training process uses
workspaces, in order to get the most
from this approach, periodic GC calls need to be
disabled:
Nd4j.getMemoryManager.togglePeriodicGc(false)
or their frequency needs to be reduced:
val gcInterval = 10000 // In milliseconds
Nd4j.getMemoryManager.setAutoGcWindow(gcInterval)
22
Spark & the DL4J
Web UI
A love/hate relationship.
23
The DL4J Training UI
24
Root Cause and Potential Solutions
Dependencies conflict between the DL4J-UI library and
Apache Spark when running in the same JVM.
Two alternatives are available:
×Collect and save the relevant training stats at runtime,
and then visualize them offline later.
×Execute the UI and use its remote functionality into
separate JVMs (servers). Metrics are uploaded from the
Spark master to the UI server.
25
Serialization & ND4J
Kryo me a river.
26
Data Serialization Options in Spark
Data Serialization is the process of converting the in-
memory objects to another format that can be used
to store or send them over the network.
Two options available in Spark:
×Java (default)
×Kryo
27
OOOps!!!
Kryo doesn’t work well with
off-heap data structures.
28
How to Use Kryo Serialization with ND4J?
×Add the ND4J-Kryo dependency to the project
×Configure the Spark application to use the ND4J Kryo
Registrator:
val sparkConf = new SparkConf
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrator", "org.nd4j.Nd4jRegistrator")
29
Data Locality
How data locality affects
performance?
30
Data Locality in Spark
Data locality in Spark means doing computation on the
node where data resides.
In order to optimize processing tasks, Spark tries to place the
execution code as close as possible to the processed data.
×It tries first to move serialized code to the data.
×Sometimes this isn’t possible and the data must be moved to
the executor.
31
Data Locality in Spark
How Spark handles data locality?
×It prefers to schedule all tasks at the best locality level.
×When there is no unprocessed data on any idle executor, it
switches to lower locality levels.
− It can wait until a busy CPU frees up to start a task on
data on the same server.
− It can immediately start a new task in a farther place that
requires moving data there.
32
Data Locality in Spark
Spark typically waits a bit in the hopes that a busy CPU
frees up. Once that timeout (default is 3 sec) expires, it
starts moving the data to the free CPU. But:
×Training neural networks with DL4J is computationally
expensive.
×So the Spark default behavior isn’t an ideal fit for maximizing
cluster utilization.
33
Data Locality in Spark and DL4J
During training on Spark, DL4J ensures that there is
exactly one task per executor: so it is always better to
immediately transfer data to a free executor, rather than
waiting for another one to become free. Computation
time is more important than any network transfer
time.
34
Data Locality in Spark and DL4J
Spark provides the spark.locality.wait configuration
property: it is the timeout (in seconds) to wait before
moving data to a free CPU.
So, when submitting the configuration for a DL4J training
app, we have to set the value of the spark.locality.wait
property to 0.
35
Handling Java Objects with Large
Off-heap Components
The Off-heap, again.
36
Spark and Large Off-heap Objects
Spark has problems handling Java objects with large off-
heap components, in particular in caching or persisting
them.
When working with DL4J, this is a frequent case, as
DataSet and NDArray objects are involved.
37
Spark and Large Off-heap Objects
Spark drops part of a RDD based on the estimated size of
that block. It estimates the size of a block depending on
the selected persistence level.
In case of MEMORY_ONLY or
MEMORY_AND_DISK, the estimate is done by
walking the Java object graph.
This process doesn't take into account the off-heap
memory used by DL4J and ND4J, so Spark under-
estimates the true size of objects like DataSets or
NDArrays. 38
Spark and Large Off-heap Objects
When deciding bewteen keeping or dropping blocks,
Spark considers only the amount of heap memory used.
DataSet and NDArray objects have a very small on-heap
size, then Spark will keep too many of them, causing out
of memory issues as off-heap memory becomes
exhausted.
39
Spark and Large Off-heap Objects
It is then good practice using MEMORY_ONLY_SER or
MEMORY_AND_DISK_SER when persisting a
RDD<DataSet> or a RDD<INDArray>.
This way Spark stores blocks on the JVM heap in
serialized form. Because there is no off-heap memory for
the serialized objects, it can accurately estimate their size,
in so avoiding out of memory issues.
40
Image Pipeline Data
Preparation
One Batch, Two Batch
(Penny and Dime)
41
Convolutional Neural Network (CNN)
42
By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
Single Machine
You can use DataVec’s
ImageRecordReader.
Image Pipeline Data Preparation
Spark Cluster
Image preprocessing.
43
Image Pipeline Data Preparation
The Spark strategy assume the images are in subdirectories
based on their class labels. Example:
imageRootDir/car/img0.jpg
imageRootDir/car/img1.jpg
...
imageRootDir/truck/img0.jpg
imageRootDir/truck/img1.jpg
...
imageRootDir/motorbike/img0.jpg
imageRootDir/motorbike/img1.jpg
... 44
Image Pipeline Data Preparation (Spark)
The approach is to preprocess the images into batches of files
(ND4J’s FileBatch objects).
The motivation: the original image files typically use efficient
compression (JPEG, PNG, other) which is much more space
and network efficient than a bitmap representation. However, on
a cluster we want to minimize disk reads due to latency issues
with remote storage – one single file read/transfer is faster than
multiple remote file reads.
45
Image Pipeline Data Preparation (Spark)
Step 1 (option 1): Preprocess images locally.
val sourceDirectory = "/home/guglielmo/training_images"
val destinationDirectory = "/home/guglielmo/preprocessed_images"
val batchSize = 32
SparkDataUtils.createFileBatchesLocal(sourceDirectory,
NativeImageLoader.ALLOWED_FORMATS, true, destinationDirectory,
batchSize)
After the preprocessing completes, the destination directory can
be copied to the cluster.
46
Image Pipeline Data Preparation (Spark)
Step 1 (option 2): Preprocess images using Spark.
val sourceDirectory = “hdfs:///data/training_images”
val destinationDirectory = “hdfs:///data/preprocessed_images”
val batchSize = 32
SparkDataUtils.createFileBatchesSpark(sourceDirectory, destinationDirectory,
batchSize, sparkContext)
47
Image Pipeline Data Preparation (Spark)
Step 2: Create a data loader...
val imageHeightWidth = 64
val imageChannels = 3
val labelMaker:PathLabelGenerator = new ParentPathLabelGenerator
val rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth,
imageChannels, labelMaker)
rr.setLabels(new TinyImageNetDataSetIterator(1).getLabels)
val numClasses = TinyImageNetFetcher.NUM_LABELS
val loader = new RecordReaderFileBatchLoader(rr, minibatch, 1, numClasses)
loader.setPreProcessor(new ImagePreProcessingScaler)
48
Image Pipeline Data Preparation (Spark)
Step 2: ...and finally train the model
val trainDataPath = "hdfs:///data/preprocessed_images"
val pathsTrain:JavaRDD<String> = SparkUtils.listPaths(sparkContext,
trainDataPath)
for (i <- 0 until numEpochs) {
sparkNet.fitPaths(pathsTrain, loader)
}
49
All the Details on DL4J and Spark in my
Book
http://tinyurl.com/y9jkvtuy
50
Thanks!
Any questions?
You can find me at
@guglielmoiozzia
https://ie.linkedin.com/in/giozzia
googlielmo.blogspot.com
51
Credits
Special thanks to all the people who made and
released these awesome resources for free:
×Presentation template by SlidesCarnival
52
Ad

More Related Content

What's hot (20)

The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
DataWorks Summit
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
DataWorks Summit
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
Alluxio, Inc.
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
DataWorks Summit
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
DataWorks Summit/Hadoop Summit
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
DataStax
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
DataWorks Summit
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
DataWorks Summit
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
DataWorks Summit
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
DataWorks Summit
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
Alluxio, Inc.
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
DataWorks Summit
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
DataStax
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
DataWorks Summit
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
DataWorks Summit
 

Similar to Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way? (20)

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and spark
François Garillot
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
Adam Gibson
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
Spark 101
Spark 101Spark 101
Spark 101
Shahaf Azriely {TopLinked} ☁
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptxEngagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and spark
François Garillot
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
Adam Gibson
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptxEngagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 

Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?

  • 1. Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?
  • 2. Hello! I am Guglielmo Iozzia Associate Director – Business Tech Analysis at Previuosly at 2
  • 3. MSD in Ireland + 50 years Approx. 2,000 employees Five sites: Ballydine, Brinny, Carlow and Dublin $2.5 billion investment to date Approx 50% MSD’s top 20 products manufactured here Export to + 60 countries €6.1 billion turnover in 2017 2017 + 300 jobs & €280m investment MSD Biotech, Dublin, coming in 2021
  • 4. Deep Learning It is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. 4
  • 5. Some Practical Applications of Deep Learning × Computer vision × Text generation × NLP and NLU × Autonomous cars × Robotics × Gaming × Quantitative finance 5
  • 6. DL4J It is an Open Source, distributed, Deep Learning framework written for JVM languages. 6
  • 7. It is integrated with Hadoop and Apache Spark and can be used on distributed GPUs and CPUs. 7
  • 8. DL4J Modules × DataVec × Arbiter × NN × Datasets × RL4J × DL4J-Spark × ND4J 8
  • 9. DL4J Code Example 9 Training and Evaluation Network Configuration
  • 10. ND4J It is an Open Source linear algebra and matrix manipulation library which supports n-dimensional arrays and it is integrated with Apache Hadoop and Spark. 10
  • 11. Apache Spark It is a unified analytics engine for large-scale data processing. 11
  • 12. Speed Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Apache Spark Ease of Use Write applications quickly in Java, Scala, Python, R and SQL. 12
  • 13. Generality Spark provides a stack of libraries that can be combined seamlessly in the same application. Apache Spark Runs Everywhere Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone or in the cloud. It can access diverse data sources. 13
  • 14. Why Distributed MNN Training with DL4J and Apache Spark? Why this is a powerful combination? 14
  • 15. DL4J + Apache Spark × DL4J provides high level API to design, configure train and evaluate MNNs. × Spark performances are excellent in particular for ETL/streaming, but in terms of computation, in a MNN training context, some data transformation/aggregation need to be done using a low-level language. × DL4J uses ND4J, which is a C++ library that provides high level Scala API to developers. 15
  • 16. Model Parallelization DL4J + Apache Spark Data Parallelization 16
  • 17. So: What could possibly go wrong? 17
  • 18. Memory Management And now, for something (a little bit) different. 18
  • 19. Memory Utilization at Training Time 19
  • 20. Memory Management in DL4J Memory allocations can be managed using two different approaches: ×JVM GC and WeakReference tracking ×MemoryWorkspaces The idea behind both is the same: once a NDArray is no longer required, the off-heap memory associated with it should be released so that it can be reused. 20
  • 21. Memory Management in DL4J The difference between the two approaches is: ×JVM GC: when a INDArray is collected by the garbage collector, its off-heap memory is deallocated, with the assumption that it is not used elsewhere. ×MemoryWorkspaces: when a INDArray leaves the workspace scope, its off-heap memory may be reused, without deallocation and reallocation. 21
  • 22. Memory Management in DL4J Please remember that, when a training process uses workspaces, in order to get the most from this approach, periodic GC calls need to be disabled: Nd4j.getMemoryManager.togglePeriodicGc(false) or their frequency needs to be reduced: val gcInterval = 10000 // In milliseconds Nd4j.getMemoryManager.setAutoGcWindow(gcInterval) 22
  • 23. Spark & the DL4J Web UI A love/hate relationship. 23
  • 25. Root Cause and Potential Solutions Dependencies conflict between the DL4J-UI library and Apache Spark when running in the same JVM. Two alternatives are available: ×Collect and save the relevant training stats at runtime, and then visualize them offline later. ×Execute the UI and use its remote functionality into separate JVMs (servers). Metrics are uploaded from the Spark master to the UI server. 25
  • 26. Serialization & ND4J Kryo me a river. 26
  • 27. Data Serialization Options in Spark Data Serialization is the process of converting the in- memory objects to another format that can be used to store or send them over the network. Two options available in Spark: ×Java (default) ×Kryo 27
  • 28. OOOps!!! Kryo doesn’t work well with off-heap data structures. 28
  • 29. How to Use Kryo Serialization with ND4J? ×Add the ND4J-Kryo dependency to the project ×Configure the Spark application to use the ND4J Kryo Registrator: val sparkConf = new SparkConf sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrator", "org.nd4j.Nd4jRegistrator") 29
  • 30. Data Locality How data locality affects performance? 30
  • 31. Data Locality in Spark Data locality in Spark means doing computation on the node where data resides. In order to optimize processing tasks, Spark tries to place the execution code as close as possible to the processed data. ×It tries first to move serialized code to the data. ×Sometimes this isn’t possible and the data must be moved to the executor. 31
  • 32. Data Locality in Spark How Spark handles data locality? ×It prefers to schedule all tasks at the best locality level. ×When there is no unprocessed data on any idle executor, it switches to lower locality levels. − It can wait until a busy CPU frees up to start a task on data on the same server. − It can immediately start a new task in a farther place that requires moving data there. 32
  • 33. Data Locality in Spark Spark typically waits a bit in the hopes that a busy CPU frees up. Once that timeout (default is 3 sec) expires, it starts moving the data to the free CPU. But: ×Training neural networks with DL4J is computationally expensive. ×So the Spark default behavior isn’t an ideal fit for maximizing cluster utilization. 33
  • 34. Data Locality in Spark and DL4J During training on Spark, DL4J ensures that there is exactly one task per executor: so it is always better to immediately transfer data to a free executor, rather than waiting for another one to become free. Computation time is more important than any network transfer time. 34
  • 35. Data Locality in Spark and DL4J Spark provides the spark.locality.wait configuration property: it is the timeout (in seconds) to wait before moving data to a free CPU. So, when submitting the configuration for a DL4J training app, we have to set the value of the spark.locality.wait property to 0. 35
  • 36. Handling Java Objects with Large Off-heap Components The Off-heap, again. 36
  • 37. Spark and Large Off-heap Objects Spark has problems handling Java objects with large off- heap components, in particular in caching or persisting them. When working with DL4J, this is a frequent case, as DataSet and NDArray objects are involved. 37
  • 38. Spark and Large Off-heap Objects Spark drops part of a RDD based on the estimated size of that block. It estimates the size of a block depending on the selected persistence level. In case of MEMORY_ONLY or MEMORY_AND_DISK, the estimate is done by walking the Java object graph. This process doesn't take into account the off-heap memory used by DL4J and ND4J, so Spark under- estimates the true size of objects like DataSets or NDArrays. 38
  • 39. Spark and Large Off-heap Objects When deciding bewteen keeping or dropping blocks, Spark considers only the amount of heap memory used. DataSet and NDArray objects have a very small on-heap size, then Spark will keep too many of them, causing out of memory issues as off-heap memory becomes exhausted. 39
  • 40. Spark and Large Off-heap Objects It is then good practice using MEMORY_ONLY_SER or MEMORY_AND_DISK_SER when persisting a RDD<DataSet> or a RDD<INDArray>. This way Spark stores blocks on the JVM heap in serialized form. Because there is no off-heap memory for the serialized objects, it can accurately estimate their size, in so avoiding out of memory issues. 40
  • 41. Image Pipeline Data Preparation One Batch, Two Batch (Penny and Dime) 41
  • 42. Convolutional Neural Network (CNN) 42 By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
  • 43. Single Machine You can use DataVec’s ImageRecordReader. Image Pipeline Data Preparation Spark Cluster Image preprocessing. 43
  • 44. Image Pipeline Data Preparation The Spark strategy assume the images are in subdirectories based on their class labels. Example: imageRootDir/car/img0.jpg imageRootDir/car/img1.jpg ... imageRootDir/truck/img0.jpg imageRootDir/truck/img1.jpg ... imageRootDir/motorbike/img0.jpg imageRootDir/motorbike/img1.jpg ... 44
  • 45. Image Pipeline Data Preparation (Spark) The approach is to preprocess the images into batches of files (ND4J’s FileBatch objects). The motivation: the original image files typically use efficient compression (JPEG, PNG, other) which is much more space and network efficient than a bitmap representation. However, on a cluster we want to minimize disk reads due to latency issues with remote storage – one single file read/transfer is faster than multiple remote file reads. 45
  • 46. Image Pipeline Data Preparation (Spark) Step 1 (option 1): Preprocess images locally. val sourceDirectory = "/home/guglielmo/training_images" val destinationDirectory = "/home/guglielmo/preprocessed_images" val batchSize = 32 SparkDataUtils.createFileBatchesLocal(sourceDirectory, NativeImageLoader.ALLOWED_FORMATS, true, destinationDirectory, batchSize) After the preprocessing completes, the destination directory can be copied to the cluster. 46
  • 47. Image Pipeline Data Preparation (Spark) Step 1 (option 2): Preprocess images using Spark. val sourceDirectory = “hdfs:///data/training_images” val destinationDirectory = “hdfs:///data/preprocessed_images” val batchSize = 32 SparkDataUtils.createFileBatchesSpark(sourceDirectory, destinationDirectory, batchSize, sparkContext) 47
  • 48. Image Pipeline Data Preparation (Spark) Step 2: Create a data loader... val imageHeightWidth = 64 val imageChannels = 3 val labelMaker:PathLabelGenerator = new ParentPathLabelGenerator val rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth, imageChannels, labelMaker) rr.setLabels(new TinyImageNetDataSetIterator(1).getLabels) val numClasses = TinyImageNetFetcher.NUM_LABELS val loader = new RecordReaderFileBatchLoader(rr, minibatch, 1, numClasses) loader.setPreProcessor(new ImagePreProcessingScaler) 48
  • 49. Image Pipeline Data Preparation (Spark) Step 2: ...and finally train the model val trainDataPath = "hdfs:///data/preprocessed_images" val pathsTrain:JavaRDD<String> = SparkUtils.listPaths(sparkContext, trainDataPath) for (i <- 0 until numEpochs) { sparkNet.fitPaths(pathsTrain, loader) } 49
  • 50. All the Details on DL4J and Spark in my Book http://tinyurl.com/y9jkvtuy 50
  • 51. Thanks! Any questions? You can find me at @guglielmoiozzia https://ie.linkedin.com/in/giozzia googlielmo.blogspot.com 51
  • 52. Credits Special thanks to all the people who made and released these awesome resources for free: ×Presentation template by SlidesCarnival 52

Editor's Notes

  • #5: The field of artificial is essentially when machines can do tasks that typically require human intelligence. It encompasses machine learning, where machines can learn by experience and acquire skills without human involvement. Deep learning is a subset of machintelligenceine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Similarly to how we learn from experience, the deep learning algorithm would perform a task repeatedly, each time tweaking it a little to improve the outcome. We refer to ‘deep learning’ because the neural networks have various (deep) layers that enable learning. Just about any problem that requires “thought” to figure out is a problem deep learning can learn to solve.
  • #6: http://www.yaronhadad.com/deep-learning-most-amazing-applications/