Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3

Wangda Tan (Hadoop PMC member @Hortonworks)
Sunil Govind (Hadoop PMC member @Hortonworks)
Deep learning on YARN: running
Tensorflow , etc. on Hadoop
clusters

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Machine Learning Basic
 Machine Learning In Production
 How YARN Helps
 Example: Running distributed Tensorflow on YARN

Machine learning basics

Basics: Machine Learning
 Cat Classifier
Cats
Labeled data (Training)
Non-Cats
Feed
Save
Predict
Cat (80%) Non-Cat (20%)
Model

Basics: Model Training
Model
Training
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– PyTorch

Basics: Why GPU?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
GPU Computation Intensive Other
Without GPU support, researchers/engineers
are almost impossible to wait job finish.

Machine learning in production

Machine Learning in tutorial
$ nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
Go to your browser on http://localhost:8888/

Machine Learning in a Unified Platform
“Hidden Technical Debt in Machine Learning Systems”, Google

Training Hierarchical Models
Word Embedding Model
Food picture classifier Model
Ensemble Model
"Burger is great.
however onion rings
were over cooked"
(Image/Photo from Yelp)

Data pipelines for Machine Learning (Big Data)
ETLData Exploration
Join / Sampling /
Feature Extraction
Split train, test Data set, etc.

How YARN helps

Running on YARN: All about data and sharing
Hadoop YARN
HDFS AWS S3 RDBMS
Spark MLlib XGBoost TensorFlow
Zeppelin / Jupyter
Hive/LLAP Spark SQL
CPU GPU SSD

Why all under YARN
SLA!
Monitoring!
A normal cluster user
Quotas!
Isolation!
Capacity Planning, Preemption, Reservation System.
Time time services, Grafana, etc.
Queues / Users quota, user access control.
CPU / Memory / GPU / FPGA, (WIP) Network/Disk
YARN

All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs

Recent works in YARN to support ML workloads like Tensorflow
 GPU isolation/scheduling support
 Native Service - Easy to define and run any custom service
 All above works available in Apache Hadoop 3.1.0

GPU support on YARN (Apache Hadoop 3.1.0)
 Why need isolation?
– Multiple processes use the single GPU will be:
• Serialized.
• Cause OOM easily.
 GPU isolation on YARN: .
– Granularity is for per-GPU device.
– Use Cgroups / docker to enforce the isolation.

Docker + GPU support on YARN (Apache Hadoop 3.1.0)
 Most of machine learning platforms has
python/R/cudnn/CUDA dependencies.
 Docker solves messy dependencies issues
– But it may introduce problems for GPU base
libraries
 Nvidia-docker-plugin mounts Nvidia driver,
etc. when container got launched.
 YARN supports Docker and as well as
nvidia-docker-plugin.
Tensorﬂow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0
Tensorﬂow 1.2
Nginx AppUbuntu 14:04
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0

Running Distributed Tensorflow on YARN

Why distributed?
Reference: https://www.tensorflow.org/performance/benchmarks

How Distributed TF Works?
 Distributed TF architecture  How to make it work?
– Set following environment: TF_CONFIG

Using YARN to run distributed Tensorflow
 What you need to do:
– Write YARN service spec with proper
TF_CONFIG in parameter.
– Run the job by using:
– yarn app -launch ${SERVICE_NAME}
${PATH_TO_SERVICE_SPEC}
 What happened under the hood

Write service spec to run distributed Tensorflow

Write service spec to serve Tensorflow model
 Note:
– Uses simple_tensorflow_serving (github.com/tobegit3hub/simple_tensorflow_serving)
– http://serving.serving-job-001.<domain-name>:port to access serving REST end point
– Still feel complicated? We’re working on wrapper to simply this!

Write service spec to run MXNet – Fine-tune model.
 Note:
– Fine-tune refers training with parameters partially initialized with pre-trained model.
– Prepare caltech256 dataset first, then fine tune it with imagenet11k-resnet-152
– YARN Native Service’s dependencies feature helps to run the prepare component first and once its completed, real
training is started on the prepared dataset.

Accelerating XGBoost applications with
GPU and Spark
https://dataworkssummit.com/berlin-2018/session/accelerating-xgboost-applications-with-gpu-
and-spark/
2:50 PM, Room I, Wed April 18th
-- Related Session --
Yanbo Liang & Mingjie Tang

Demo

Questions?

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3

Recommended

More Related Content

What's hot (20)

Similar to Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3 (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3

Editor's Notes