SlideShare a Scribd company logo
Deep Learning Pipelines
@joerg_schad @dcos
© 2018 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Distributed Systems Engineer
@joerg_schad
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: Some insight
5
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
6
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
7
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2017 Mesosphere, Inc. All Rights Reserved.
Training Challenges
8
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….
© 2018 Mesosphere, Inc. All Rights Reserved.
Input Data Management
9
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 10
Challenges
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning Frameworks
11
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
12
What is Tensorflow?
“An open-source software library for Machine Intelligence” -
tensorflow.org
© 2018 Mesosphere, Inc. All Rights Reserved. 13
What is Tensorflow?
“An open-source software library for Machine Intelligence” -
tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs
© 2018 Mesosphere, Inc. All Rights Reserved. 14
Alternatives
© 2018 Mesosphere, Inc. All Rights Reserved. 15
Data Analytics Ecosystem
© 2018 Mesosphere, Inc. All Rights Reserved. 16
Challenges
● Different Frameworks
● No one rules them all
Solutions
● Choice
● Deployments?
● Models across Frameworks?
Deep Learning Frameworks
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
17
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 18
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 19
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
Cluster Management and Deployments
20
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2017 Mesosphere, Inc. All Rights Reserved. 21
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an
open-source, distributed operating system
● It takes Mesos and builds upon it with
additional services and functionality
○ Built-in support for service discovery, load balancing, security, and
ease of installation
○ Extra tooling (e.g. comprehensive CLI and a GUI)
○ Built-in frameworks for launching long running services (Marathon)
and batch jobs (Metronome)
○ A repository (app-store) for installing other common packages and
frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow)
22
What is DC/OS?
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow
(Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
23
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
24
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
25
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
27
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
28
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
29
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
32
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
© 2018 Mesosphere, Inc. All Rights Reserved.
Model Management
33
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
Recall
34
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
Many Models
35
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
© 2018 Mesosphere, Inc. All Rights Reserved. 36
Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
37
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 38
Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
● ...
Solutions
● TensorFlow Serving
Model Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
39
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 40
Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring
© 2017 Mesosphere, Inc. All Rights Reserved. 41
Demo Time
© 2018 Mesosphere, Inc. All Rights Reserved.
Related Work
42
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
43
Ben Wood
Robin Oh
Evan Lezar
Art Rand
Gabriel Hartmann
Sam Pringle Kevin Klues
© 2018 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-
dcos
○ Slack: chat.dcos.io #tensorflow
Questions and Links
44

More Related Content

PPTX
Webinar: Déployez facilement Kubernetes & vos containers
Mesosphere Inc.
 
PPTX
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Mesosphere Inc.
 
PPTX
Webinar: Operating Kubernetes at Scale
Mesosphere Inc.
 
PPTX
Episode 1: Building Kubernetes-as-a-Service
Mesosphere Inc.
 
PDF
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
VMware Tanzu
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Hadoop Everywhere
DataWorks Summit/Hadoop Summit
 
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion
 
Webinar: Déployez facilement Kubernetes & vos containers
Mesosphere Inc.
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Mesosphere Inc.
 
Webinar: Operating Kubernetes at Scale
Mesosphere Inc.
 
Episode 1: Building Kubernetes-as-a-Service
Mesosphere Inc.
 
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
VMware Tanzu
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion
 

What's hot (20)

PDF
Machine Learning Model Deployment: Strategy to Implementation
DataWorks Summit
 
PDF
The Future of Computing is Distributed
Alluxio, Inc.
 
PDF
Emerging trends in data analytics
Wei-Chiu Chuang
 
PDF
Greenplum for Kubernetes - Greenplum Summit 2019
VMware Tanzu
 
PPTX
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Mathieu Dumoulin
 
PPT
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
DataWorks Summit
 
PDF
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
PDF
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
PPTX
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Aditya Yadav
 
PDF
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
VMware Tanzu
 
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
PPTX
Keep your Hadoop Cluster at its Best
DataWorks Summit/Hadoop Summit
 
PDF
Single View of Well, Production and Assets
John Archer
 
PPTX
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
PDF
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
VMware Tanzu
 
PDF
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
VMware Tanzu
 
PPTX
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
PDF
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
PDF
Architecting for Continuous Delivery
Mohammad Bilal Wahla
 
PDF
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
Machine Learning Model Deployment: Strategy to Implementation
DataWorks Summit
 
The Future of Computing is Distributed
Alluxio, Inc.
 
Emerging trends in data analytics
Wei-Chiu Chuang
 
Greenplum for Kubernetes - Greenplum Summit 2019
VMware Tanzu
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Mathieu Dumoulin
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
DataWorks Summit
 
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Aditya Yadav
 
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
VMware Tanzu
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
Keep your Hadoop Cluster at its Best
DataWorks Summit/Hadoop Summit
 
Single View of Well, Production and Assets
John Archer
 
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
VMware Tanzu
 
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
VMware Tanzu
 
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Architecting for Continuous Delivery
Mohammad Bilal Wahla
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
Ad

Similar to Webinar: Deep Learning Pipelines Beyond the Learning (20)

PDF
TensorFlow 16: Building a Data Science Platform
Seldon
 
PDF
From zero to one - How we evolved our test automation processes and mindset i...
Jen-Chieh Ko
 
PDF
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
 
PDF
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Eduardo Gaspar
 
PDF
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 
PDF
ODSC data science to DataOps
Christopher Bergh
 
PDF
OpenPOWER Boot camp in Zurich
Ganesan Narayanasamy
 
PDF
Open Source AI - News and examples
Luciano Resende
 
PPTX
Large Model support and Distribute deep learning
Ganesan Narayanasamy
 
PPTX
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
PDF
Machine Learning Infrastructure
SigOpt
 
PDF
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
PDF
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
PDF
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
PDF
Machine Learning for Capacity Management
EDB
 
PPTX
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
PDF
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
PDF
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
PPTX
Deep Learning for Recommender Systems
Nick Pentreath
 
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
TensorFlow 16: Building a Data Science Platform
Seldon
 
From zero to one - How we evolved our test automation processes and mindset i...
Jen-Chieh Ko
 
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
 
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Eduardo Gaspar
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 
ODSC data science to DataOps
Christopher Bergh
 
OpenPOWER Boot camp in Zurich
Ganesan Narayanasamy
 
Open Source AI - News and examples
Luciano Resende
 
Large Model support and Distribute deep learning
Ganesan Narayanasamy
 
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Machine Learning Infrastructure
SigOpt
 
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
Machine Learning for Capacity Management
EDB
 
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
Deep Learning for Recommender Systems
Nick Pentreath
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Ad

More from Mesosphere Inc. (20)

PPTX
DevOps in Age of Kubernetes
Mesosphere Inc.
 
PPTX
Java EE Modernization with Mesosphere DCOS
Mesosphere Inc.
 
PPTX
Operating Kubernetes at Scale (Australia Presentation)
Mesosphere Inc.
 
PPTX
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
PPTX
Episode 3: Kubernetes and Big Data Services
Mesosphere Inc.
 
PPTX
Episode 2: Deploying Kubernetes at Scale
Mesosphere Inc.
 
PDF
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Mesosphere Inc.
 
PPTX
Webinar: What's New in DC/OS 1.11
Mesosphere Inc.
 
PPTX
Webinar: End-to-End CI/CD with GitLab and DC/OS
Mesosphere Inc.
 
PPTX
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Mesosphere Inc.
 
PDF
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Mesosphere Inc.
 
PDF
Deploying Kong with Mesosphere DC/OS
Mesosphere Inc.
 
PPTX
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Mesosphere Inc.
 
PDF
Discover the all new Mesosphere DC/OS 1.10
Mesosphere Inc.
 
PDF
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
Mesosphere Inc.
 
PDF
Easy Docker Deployments with Mesosphere DCOS on Azure
Mesosphere Inc.
 
PPTX
Mesos framework API v1
Mesosphere Inc.
 
PPTX
Scaling Like Twitter with Apache Mesos
Mesosphere Inc.
 
PDF
Elastic jenkins with mesos and dcos (2016 01-20)
Mesosphere Inc.
 
PDF
Growing the Mesos Ecosystem
Mesosphere Inc.
 
DevOps in Age of Kubernetes
Mesosphere Inc.
 
Java EE Modernization with Mesosphere DCOS
Mesosphere Inc.
 
Operating Kubernetes at Scale (Australia Presentation)
Mesosphere Inc.
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
Episode 3: Kubernetes and Big Data Services
Mesosphere Inc.
 
Episode 2: Deploying Kubernetes at Scale
Mesosphere Inc.
 
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Mesosphere Inc.
 
Webinar: What's New in DC/OS 1.11
Mesosphere Inc.
 
Webinar: End-to-End CI/CD with GitLab and DC/OS
Mesosphere Inc.
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Mesosphere Inc.
 
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Mesosphere Inc.
 
Deploying Kong with Mesosphere DC/OS
Mesosphere Inc.
 
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Mesosphere Inc.
 
Discover the all new Mesosphere DC/OS 1.10
Mesosphere Inc.
 
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
Mesosphere Inc.
 
Easy Docker Deployments with Mesosphere DCOS on Azure
Mesosphere Inc.
 
Mesos framework API v1
Mesosphere Inc.
 
Scaling Like Twitter with Apache Mesos
Mesosphere Inc.
 
Elastic jenkins with mesos and dcos (2016 01-20)
Mesosphere Inc.
 
Growing the Mesos Ecosystem
Mesosphere Inc.
 

Recently uploaded (20)

PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
This slide provides an overview Technology
mineshkharadi333
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Doc9.....................................
SofiaCollazos
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 

Webinar: Deep Learning Pipelines Beyond the Learning

  • 2. © 2018 Mesosphere, Inc. All Rights Reserved. 2 Jörg Schad Distributed Systems Engineer @joerg_schad
  • 3. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Promise 3
  • 4. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Process 4 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 5. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: Some insight 5
  • 6. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 6
  • 7. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 7 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 8. © 2017 Mesosphere, Inc. All Rights Reserved. Training Challenges 8 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model ● Compute Intensive ○ (Hopefully) Large Datasets ■ Train ■ Dev ■ Test ○ Hyperparameter ■ #Layer ■ #Units per Layer ■ Learning Rate ■ ….
  • 9. © 2018 Mesosphere, Inc. All Rights Reserved. Input Data Management 9 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 10. © 2018 Mesosphere, Inc. All Rights Reserved. 10 Challenges ● Training/Dev/Test + New Data ● Large amounts ● Quality ● Availability (for cluster) ● Velocity ● Streaming Solutions GFS Input Data Management Input: Lots of Labeled Data Apache Kafka Apache Cassandra
  • 11. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning Frameworks 11 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 12. © 2018 Mesosphere, Inc. All Rights Reserved. ● Machine Intelligence is the broad term used to describe techniques allowing computers to “learn” by analyzing very large data sets using artificial neural networks 12 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org
  • 13. © 2018 Mesosphere, Inc. All Rights Reserved. 13 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org ● Tensorflow is a software library that makes it easy for developers to construct artificial neural networks to analyze their data of interest TensorFlow Library Python Dataflow Executor, Compute Kernel Implementations, Networking, etc. GPUs CPUs
  • 14. © 2018 Mesosphere, Inc. All Rights Reserved. 14 Alternatives
  • 15. © 2018 Mesosphere, Inc. All Rights Reserved. 15 Data Analytics Ecosystem
  • 16. © 2018 Mesosphere, Inc. All Rights Reserved. 16 Challenges ● Different Frameworks ● No one rules them all Solutions ● Choice ● Deployments? ● Models across Frameworks? Deep Learning Frameworks
  • 17. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 17 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 18. © 2018 Mesosphere, Inc. All Rights Reserved. 18 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 19. © 2018 Mesosphere, Inc. All Rights Reserved. 19 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 20. © 2018 Mesosphere, Inc. All Rights Reserved. Cluster Management and Deployments 20 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 21. © 2017 Mesosphere, Inc. All Rights Reserved. 21 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines Tensorflow Jenkins Kafka Spark Tensorflow
  • 22. © 2017 Mesosphere, Inc. All Rights Reserved. ● DC/OS (Data Center Operating System) is an open-source, distributed operating system ● It takes Mesos and builds upon it with additional services and functionality ○ Built-in support for service discovery, load balancing, security, and ease of installation ○ Extra tooling (e.g. comprehensive CLI and a GUI) ○ Built-in frameworks for launching long running services (Marathon) and batch jobs (Metronome) ○ A repository (app-store) for installing other common packages and frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow) 22 What is DC/OS?
  • 23. © 2017 Mesosphere, Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 23 Input Data Set
  • 24. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 24
  • 25. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 25 Trained Model Input Data Set
  • 26. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow 27 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs
  • 27. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 28
  • 28. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 29 Trained Model Input Data Set
  • 29. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 32 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 30. © 2018 Mesosphere, Inc. All Rights Reserved. Model Management 33 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 31. © 2018 Mesosphere, Inc. All Rights Reserved. Recall 34 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 32. © 2017 Mesosphere, Inc. All Rights Reserved. Many Models 35 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model
  • 33. © 2018 Mesosphere, Inc. All Rights Reserved. 36 Challenges ● Many Models ● Different Hyperparameter ● Different Models ● New Training Data ● ... Solutions ● Persistent Storage + Metadata Model Management GFS
  • 34. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 37 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 35. © 2018 Mesosphere, Inc. All Rights Reserved. 38 Challenges ● How to Deploy Models? ● Zero Downtime ● Canary ● ... Solutions ● TensorFlow Serving Model Serving
  • 36. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 39 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 37. © 2018 Mesosphere, Inc. All Rights Reserved. 40 Challenges ● Understand {...} ● Debug ● Model Quality ● Accuracy ● Training Time ● … ● Overall Architecture ● Availability ● Latencies ● ... Solutions ● TensorBoard ● Traditional Cluster Monitoring Tool Monitoring
  • 38. © 2017 Mesosphere, Inc. All Rights Reserved. 41 Demo Time
  • 39. © 2018 Mesosphere, Inc. All Rights Reserved. Related Work 42 ● DC/OS TensorFlow https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ ● DC/OS PyTorch https://mesosphere.com/blog/deep-learning-pytorch-gpus/ ● Ted Dunning’s Machine Learning Logistics https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/ ● KubeFlow https://github.com/kubeflow/kubeflow ● Tensorflow (+ TensorBoard and Serving) https://www.tensorflow.org/
  • 40. © 2018 Mesosphere, Inc. All Rights Reserved. Special Thanks to All Collaborators 43 Ben Wood Robin Oh Evan Lezar Art Rand Gabriel Hartmann Sam Pringle Kevin Klues
  • 41. © 2018 Mesosphere, Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow- dcos ○ Slack: chat.dcos.io #tensorflow Questions and Links 44

Editor's Notes

  • #7: One thing being a developer build a TensorFlow Model on my Laptop...
  • #19: https://jupyterhub.readthedocs.io/en/latest/ https://github.com/vigsterkr/marathonspawner https://github.com/twosigma/beakerx
  • #20: https://jupyterhub.readthedocs.io/en/latest/ https://github.com/vigsterkr/marathonspawner
  • #22: - status quo: statically partitioned into siloed clusters, dedicated to running individual datacenter-scale applications Data: SQL, HDFS, Cassandra Services: compute (Spark, MapReduce), microservices, Docker Users: by department/team, per-user dev clusters Environment: dev/qa/prod
  • #39: https://www.tensorflow.org/
  • #41: https://www.tensorflow.org/