SlideShare a Scribd company logo
Peter Andreas Entschev
Senior System Software Engineer – NVIDIA
EuroPython, 10 July 2019
Distributed Multi-GPU Computing with
Dask, CuPy and RAPIDS
2
Outline
• Interoperability / Flexibility
• Acceleration (Scaling Up)
• Distribution (Scaling Out)
3
Clustering
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
Find Clusters
from sklearn.datasets import make_moons
import pandas
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = pandas.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
Code Example
4
GPU-Accelerated Clustering
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
Find Clusters
from sklearn.datasets import make_moons
import cudf
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = cudf.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
Code Example
5
What is RAPIDS?
• Suite of open source, end-to-end data science tools
• Built on CUDA
• Unifying framework for GPU data science
• Pandas-like API for data preparation
• Scikit-learn-like API for machine learning
New GPU-Accelerated Data Science Pipeline
6
cuDF cuIO
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch Chainer MxNet
Deep Learning
cuXfilter <> Kepler.gl
Visualization
RAPIDS
End-to-End GPU-Accelerated Data Science
7
Learning from Apache Arrow
From Apache Arrow Home Page - https://arrow.apache.org/
8
Data
preparation /
wrangling
cuDF
ML model
training
cuML VISUALIZE
Dataset
exploration
DATA PREDICTIONS
Data Science Workflow with RAPIDS
Open Source, GPU-Accelerated ML Built on CUDA
9
Ecosystem Partners
10
ML Technology Stack
Python
Cython
cuML Algorithms
cuML Prims
CUDA Libraries
CUDA
Dask cuML
Dask cuDF
cuDF
CuPy
Numpy
Thrust
Cub
nvGraph
cuBLAS
cuRand
cuSolver
cuSparse
CUTLASS
11
High-Level APIs
Data Parallelism
Model Parallelism
CUDA/C++
Multi-Node / Multi-GPU Communication
ML Primitives
Python
Dask Multi-GPU ML
Host 2
GPU1 GPU3
GPU2 GPU4
Host 1
GPU1 GPU3
GPU2 GPU4
ML Algorithms
12
UMAP
Dimensionality reduction technique now on GPU
https://ai.googleblog.com/2019/03/exploring-neural-networks.html
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be
used for visualization similarly to t-SNE, but also for general non-linear dimension reduction.
• Fast
• General purpose dimension reduction
• Scales beyond what most t-SNE packages can
manage
• Often preserves global structure better than
t-SNE
• Supports a wide variety of distance functions
• Supports adding new points to an existing
embedding via the standard scikit-learn
transform method
• Supports supervised and semi-supervised
dimension reduction
• Has solid theoretical foundations in manifold
learning
https://arxiv.org/pdf/1802.03426.pdf
13
UMAP
GPU vs CPU
GPU: 10.5 seconds CPU: 100 seconds
14
Dask
What is Dask and why does RAPIDS use it for scaling out?
• Distributed compute scheduler built to scale
Python
• Scales workloads from laptops to
supercomputer clusters
• Extremely modular: disjoint scheduling,
compute, data transfer and out-of-core
handling
• Multiple workers per node allow easier one-
worker-per-GPU model
15
Distributing Dask
Distributed array from many arrays
NumPy
Array
Dask
Array
16
Combine Dask with CuPy
Distributed GPU array from many GPU arrays
GPU
Array
Dask
Array
17
NumPy Array Function (NEP-18)
Interoperability of NumPy-like Libraries
• Function dispatch mechanism
• Allows using NumPy as a high-level API
• NumPy-like arrays need only to implement
__array_function__
18
Dask SVD Example
Interoperability of NumPy-like Libraries
In [1]: import dask, dask.array
...: import numpy
In [2]: x = numpy.random.random((1000000, 1000))
...: dx = dask.array.from_array(x, chunks=(10000, 1000), asarray=False)
In [3]: u, s, v = numpy.linalg.svd(dx)
In [4]: %%time
...: u, s, v = dask.compute(u, s, v)
CPU times: user 39min 4s, sys: 47min 31s, total: 1h 26min 35s
Wall time: 1min 21s
19
Dask+CuPy SVD Example
Interoperability of NumPy-like Libraries
In [1]: import dask, dask.array
...: import numpy
...: import cupy
In [2]: x = cupy.random.random((1000000, 1000))
...: dx = dask.array.from_array(x, chunks=(10000, 1000), asarray=False)
In [3]: u, s, v = numpy.linalg.svd(dx)
In [4]: %%time
...: u, s, v = dask.compute(u, s, v)
CPU times: user 34.5 s, sys: 17.6 s, total: 52.1 s
Wall time: 41 s
20
NumPy Array Function (NEP-18)
Protocol Limitations
• Universal functions – __array_ufunc__ already addresses those
• numpy.array() and numpy.asarray() – will require their own protocol
• Dispatch for methods of any kind – e.g., numpy.random.RandomState()
21
uarray
Alternative to __array_function__
• Generic multiple-dispatch mechanism
• Intended to address shortcomings of NEP-18
• https://uarray.readthedocs.io/
22
uarray
CuPy Example
In [1]: import uarray as ua
...: import unumpy as np
...: import unumpy.cupy_backend as cupy_backend
In [2]: with ua.set_backend(cupy_backend):
...: a = np.ones((2, 2))
...: print(np.sum(a))
...: print(type(a))
...: print(type(np.sum(a)))
4.0
<class 'cupy.core.core.ndarray’>
<class 'cupy.core.core.ndarray'>
23
uarray
Dask+CuPy Example
In [1]: import uarray as ua
...: import unumpy as np
...: import unumpy.cupy_backend as cupy_backend
...: import unumpy.dask_backend as dask_backend
In [2]: with ua.set_backend(cupy_backend), ua.set_backend(dask_backend):
...: a = np.ones((2, 2))
...: print(np.sum(a).compute())
...: print(type(a))
...: print(type(np.sum(a).compute()))
4.0
<class 'dask.array.core.Array’>
<class 'numpy.float64’> # currently
<class 'cupy.core.core.ndarray’> # expected – Dask will need to support
uarray for this to work!
24
Python CUDA Array Interface
Interoperability for Python GPU Array Libraries
• GPU array standard
• Allows sharing GPU array between different libraries
• Native ingest and export of __cuda_array_interface__
compatible objects via Numba device arrays in cuDF
• Numba, CuPy, and PyTorch are the first libraries to adopt
the interface:
• https://numba.pydata.org/numba-
doc/dev/cuda/cuda_array_interface.html
• https://github.com/cupy/cupy/releases/tag/v5.0.0b4
• https://github.com/pytorch/pytorch/pull/11984
25
Interoperability for the Win
DLPack and __cuda_array_interface__
26
Challenges: Communication
OpenUCX
• TCP sockets are slow!
• UCX provides uniform access to transports (TCP,
InfiniBand, shared memory, NVLink)
• Python bindings for UCX (ucx-py) in the works
https://github.com/rapidsai/ucx-py
• Will provide best communication performance, to Dask
according to available hardware on nodes/cluster
27
Challenges: Communication
OpenUCX Performance – Before and After
28
Benchmark: single-GPU CuPy vs NumPy
More details: https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks
29
Benchmarks: single-GPU cuML vs scikit-learn
30
SVD Benchmark
31
Scale up with RAPIDS
Accelerated on single GPU
NumPy -> CuPy/PyTorch/..
Pandas -> cuDF
Scikit-Learn -> cuML
Numba -> Numba
RAPIDS and Others
NumPy, Pandas, Scikit-Learn,
Numba and many more
Single CPU core
In-memory data
PyData
ScaleUp/Accelerate
32
Scale up and out with RAPIDS and Dask
Accelerated on single GPU
NumPy -> CuPy/PyTorch/..
Pandas -> cuDF
Scikit-Learn -> cuML
Numba -> Numba
RAPIDS and Others
Multi-GPU
On single Node (DGX)
Or across a cluster
Dask + RAPIDS
ScaleUp/Accelerate
Scale out / Parallelize
NumPy, Pandas, Scikit-Learn,
Numba and many more
Single CPU core
In-memory data
PyData
Multi-core and Distributed PyData
NumPy -> Dask Array
Pandas -> Dask DataFrame
Scikit-Learn -> Dask-ML
… -> Dask Futures
Dask
33
Road to 1.0
October 2018 - RAPIDS 0.1
cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU
Gradient Boosted Decision Trees (GBDT)
GLM
Logistic Regression
Random Forest (regression)
K-Means
K-NN
DBSCAN
UMAP
ARIMA
Kalman Filter
Holts-Winters
Principal Components
Singular Value Decomposition
34
Road to 1.0
June 2019 - RAPIDS 0.8
cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU
Gradient Boosted Decision Trees (GBDT)
GLM
Logistic Regression
Random Forest (regression)
K-Means
K-NN
DBSCAN
UMAP
ARIMA
Kalman Filter
Holts-Winters
Principal Components
Singular Value Decomposition
35
Road to 1.0
Q4 - 2019 - RAPIDS 0.12?
cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU
Gradient Boosted Decision Trees (GBDT)
GLM
Logistic Regression
Random Forest (regression)
K-Means
K-NN
DBSCAN
UMAP
ARIMA
Kalman Filter
Holts-Winters
Principal Components
Singular Value Decomposition
36
Road to 1.0
Focused on robust functionality, deployment, and user experience
Integration with every major cloud provider
Both containers and cloud specific machine instances
Support for Enterprise and HPC Orchestration Layers
37
• https://ngc.nvidia.com/registry/nvidia-
rapidsai-rapidsai
• https://hub.docker.com/r/rapidsai/rapidsai/
• https://github.com/rapidsai
• https://anaconda.org/rapidsai/
RAPIDS
How do I get the software?
38
Additional Reading Material
• Python, Performance and GPUs (Matthew Rocklin):
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d?ncid=so-twi-n2-
96487&linkId=100000006881312
• NEP-18: A Dispatch Mechanism for NumPy’s high level array functions (Stephan Hoyer, et al.):
https://www.numpy.org/neps/nep-0018-array-function-protocol.html
• uarray update: API changes, overhead and comparison to __array_function__ (Hameer Abbasi):
https://labs.quansight.org/blog/2019/07/uarray-update-api-changes-overhead-and-
comparison-to-__array_function__/
THANK YOU
Peter Andreas Entschev
pentschev@nvidia.com
@PeterEntschev

More Related Content

PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
PPT
Spark streaming with kafka
Dori Waldman
 
PDF
Cuda introduction
Hanibei
 
PDF
Optimization for Deep Learning
Sebastian Ruder
 
PPTX
Introduction to Pig
Prashanth Babu
 
PDF
NVIDIA Rapids presentation
testSri1
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Spark streaming with kafka
Dori Waldman
 
Cuda introduction
Hanibei
 
Optimization for Deep Learning
Sebastian Ruder
 
Introduction to Pig
Prashanth Babu
 
NVIDIA Rapids presentation
testSri1
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

What's hot (20)

PPTX
Traveling salesman problem
Jayesh Chauhan
 
PDF
Big Bird - Transformers for Longer Sequences
taeseon ryu
 
PDF
DMTM Lecture 12 Hierarchical clustering
Pier Luca Lanzi
 
PDF
GPU Programming
William Cunningham
 
PDF
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Roelof Pieters
 
PPT
Greedy Algorihm
Muhammad Amjad Rana
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Divide and Conquer
Mohammed Hussein
 
PPTX
Asymptotic Notations
Rishabh Soni
 
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
PDF
Optimization in deep learning
Rakshith Sathish
 
PDF
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PPTX
Cuda
Amy Devadas
 
PDF
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
PDF
All pairs shortest path algorithm
Srikrishnan Suresh
 
PDF
Image analysis using python
Jerlyn Manohar
 
PDF
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
PDF
Daa notes 3
smruti sarangi
 
PDF
Introduction to spark
Duyhai Doan
 
Traveling salesman problem
Jayesh Chauhan
 
Big Bird - Transformers for Longer Sequences
taeseon ryu
 
DMTM Lecture 12 Hierarchical clustering
Pier Luca Lanzi
 
GPU Programming
William Cunningham
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Roelof Pieters
 
Greedy Algorihm
Muhammad Amjad Rana
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Divide and Conquer
Mohammed Hussein
 
Asymptotic Notations
Rishabh Soni
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Optimization in deep learning
Rakshith Sathish
 
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee
 
New Directions for Apache Arrow
Wes McKinney
 
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
All pairs shortest path algorithm
Srikrishnan Suresh
 
Image analysis using python
Jerlyn Manohar
 
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
Daa notes 3
smruti sarangi
 
Introduction to spark
Duyhai Doan
 
Ad

Similar to Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS (20)

PDF
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
PDF
RAPIDS Overview
NVIDIA Japan
 
PDF
CuPy: A NumPy-compatible Library for GPU
Shohei Hido
 
PPTX
Scaling Python to CPUs and GPUs
Travis Oliphant
 
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
PDF
Fast and Scalable Python
Travis Oliphant
 
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
PDF
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
PDF
The evolution of array computing in Python
Ralf Gommers
 
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
PDF
CuPy v4 and v5 roadmap
Preferred Networks
 
PDF
The road ahead for scientific computing with Python
Ralf Gommers
 
PPTX
Gpu workshop cluster universe: scripting cuda
Ferdinand Jamitzky
 
PDF
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Ray Hilton
 
PDF
k8s-batch-sig_-_Dask_on_Kubernetes.pptx__1_.pdf
RyzaAlvieMancunian
 
PDF
Standardizing arrays -- Microsoft Presentation
Travis Oliphant
 
PPTX
CUDA DLI Training Courses at GTC 2019
NVIDIA
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
RAPIDS Overview
NVIDIA Japan
 
CuPy: A NumPy-compatible Library for GPU
Shohei Hido
 
Scaling Python to CPUs and GPUs
Travis Oliphant
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Rapids: Data Science on GPUs
inside-BigData.com
 
Fast and Scalable Python
Travis Oliphant
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
The evolution of array computing in Python
Ralf Gommers
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
CuPy v4 and v5 roadmap
Preferred Networks
 
The road ahead for scientific computing with Python
Ralf Gommers
 
Gpu workshop cluster universe: scripting cuda
Ferdinand Jamitzky
 
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Ray Hilton
 
k8s-batch-sig_-_Dask_on_Kubernetes.pptx__1_.pdf
RyzaAlvieMancunian
 
Standardizing arrays -- Microsoft Presentation
Travis Oliphant
 
CUDA DLI Training Courses at GTC 2019
NVIDIA
 
Ad

Recently uploaded (20)

PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Exploring AI Agents in Process Industries
amoreira6
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS

  • 1. Peter Andreas Entschev Senior System Software Engineer – NVIDIA EuroPython, 10 July 2019 Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
  • 2. 2 Outline • Interoperability / Flexibility • Acceleration (Scaling Up) • Distribution (Scaling Out)
  • 3. 3 Clustering from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) Find Clusters from sklearn.datasets import make_moons import pandas X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = pandas.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) Code Example
  • 4. 4 GPU-Accelerated Clustering from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) Find Clusters from sklearn.datasets import make_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) Code Example
  • 5. 5 What is RAPIDS? • Suite of open source, end-to-end data science tools • Built on CUDA • Unifying framework for GPU data science • Pandas-like API for data preparation • Scikit-learn-like API for machine learning New GPU-Accelerated Data Science Pipeline
  • 6. 6 cuDF cuIO Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> Kepler.gl Visualization RAPIDS End-to-End GPU-Accelerated Data Science
  • 7. 7 Learning from Apache Arrow From Apache Arrow Home Page - https://arrow.apache.org/
  • 8. 8 Data preparation / wrangling cuDF ML model training cuML VISUALIZE Dataset exploration DATA PREDICTIONS Data Science Workflow with RAPIDS Open Source, GPU-Accelerated ML Built on CUDA
  • 10. 10 ML Technology Stack Python Cython cuML Algorithms cuML Prims CUDA Libraries CUDA Dask cuML Dask cuDF cuDF CuPy Numpy Thrust Cub nvGraph cuBLAS cuRand cuSolver cuSparse CUTLASS
  • 11. 11 High-Level APIs Data Parallelism Model Parallelism CUDA/C++ Multi-Node / Multi-GPU Communication ML Primitives Python Dask Multi-GPU ML Host 2 GPU1 GPU3 GPU2 GPU4 Host 1 GPU1 GPU3 GPU2 GPU4 ML Algorithms
  • 12. 12 UMAP Dimensionality reduction technique now on GPU https://ai.googleblog.com/2019/03/exploring-neural-networks.html Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. • Fast • General purpose dimension reduction • Scales beyond what most t-SNE packages can manage • Often preserves global structure better than t-SNE • Supports a wide variety of distance functions • Supports adding new points to an existing embedding via the standard scikit-learn transform method • Supports supervised and semi-supervised dimension reduction • Has solid theoretical foundations in manifold learning https://arxiv.org/pdf/1802.03426.pdf
  • 13. 13 UMAP GPU vs CPU GPU: 10.5 seconds CPU: 100 seconds
  • 14. 14 Dask What is Dask and why does RAPIDS use it for scaling out? • Distributed compute scheduler built to scale Python • Scales workloads from laptops to supercomputer clusters • Extremely modular: disjoint scheduling, compute, data transfer and out-of-core handling • Multiple workers per node allow easier one- worker-per-GPU model
  • 15. 15 Distributing Dask Distributed array from many arrays NumPy Array Dask Array
  • 16. 16 Combine Dask with CuPy Distributed GPU array from many GPU arrays GPU Array Dask Array
  • 17. 17 NumPy Array Function (NEP-18) Interoperability of NumPy-like Libraries • Function dispatch mechanism • Allows using NumPy as a high-level API • NumPy-like arrays need only to implement __array_function__
  • 18. 18 Dask SVD Example Interoperability of NumPy-like Libraries In [1]: import dask, dask.array ...: import numpy In [2]: x = numpy.random.random((1000000, 1000)) ...: dx = dask.array.from_array(x, chunks=(10000, 1000), asarray=False) In [3]: u, s, v = numpy.linalg.svd(dx) In [4]: %%time ...: u, s, v = dask.compute(u, s, v) CPU times: user 39min 4s, sys: 47min 31s, total: 1h 26min 35s Wall time: 1min 21s
  • 19. 19 Dask+CuPy SVD Example Interoperability of NumPy-like Libraries In [1]: import dask, dask.array ...: import numpy ...: import cupy In [2]: x = cupy.random.random((1000000, 1000)) ...: dx = dask.array.from_array(x, chunks=(10000, 1000), asarray=False) In [3]: u, s, v = numpy.linalg.svd(dx) In [4]: %%time ...: u, s, v = dask.compute(u, s, v) CPU times: user 34.5 s, sys: 17.6 s, total: 52.1 s Wall time: 41 s
  • 20. 20 NumPy Array Function (NEP-18) Protocol Limitations • Universal functions – __array_ufunc__ already addresses those • numpy.array() and numpy.asarray() – will require their own protocol • Dispatch for methods of any kind – e.g., numpy.random.RandomState()
  • 21. 21 uarray Alternative to __array_function__ • Generic multiple-dispatch mechanism • Intended to address shortcomings of NEP-18 • https://uarray.readthedocs.io/
  • 22. 22 uarray CuPy Example In [1]: import uarray as ua ...: import unumpy as np ...: import unumpy.cupy_backend as cupy_backend In [2]: with ua.set_backend(cupy_backend): ...: a = np.ones((2, 2)) ...: print(np.sum(a)) ...: print(type(a)) ...: print(type(np.sum(a))) 4.0 <class 'cupy.core.core.ndarray’> <class 'cupy.core.core.ndarray'>
  • 23. 23 uarray Dask+CuPy Example In [1]: import uarray as ua ...: import unumpy as np ...: import unumpy.cupy_backend as cupy_backend ...: import unumpy.dask_backend as dask_backend In [2]: with ua.set_backend(cupy_backend), ua.set_backend(dask_backend): ...: a = np.ones((2, 2)) ...: print(np.sum(a).compute()) ...: print(type(a)) ...: print(type(np.sum(a).compute())) 4.0 <class 'dask.array.core.Array’> <class 'numpy.float64’> # currently <class 'cupy.core.core.ndarray’> # expected – Dask will need to support uarray for this to work!
  • 24. 24 Python CUDA Array Interface Interoperability for Python GPU Array Libraries • GPU array standard • Allows sharing GPU array between different libraries • Native ingest and export of __cuda_array_interface__ compatible objects via Numba device arrays in cuDF • Numba, CuPy, and PyTorch are the first libraries to adopt the interface: • https://numba.pydata.org/numba- doc/dev/cuda/cuda_array_interface.html • https://github.com/cupy/cupy/releases/tag/v5.0.0b4 • https://github.com/pytorch/pytorch/pull/11984
  • 25. 25 Interoperability for the Win DLPack and __cuda_array_interface__
  • 26. 26 Challenges: Communication OpenUCX • TCP sockets are slow! • UCX provides uniform access to transports (TCP, InfiniBand, shared memory, NVLink) • Python bindings for UCX (ucx-py) in the works https://github.com/rapidsai/ucx-py • Will provide best communication performance, to Dask according to available hardware on nodes/cluster
  • 28. 28 Benchmark: single-GPU CuPy vs NumPy More details: https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks
  • 31. 31 Scale up with RAPIDS Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba RAPIDS and Others NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PyData ScaleUp/Accelerate
  • 32. 32 Scale up and out with RAPIDS and Dask Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba RAPIDS and Others Multi-GPU On single Node (DGX) Or across a cluster Dask + RAPIDS ScaleUp/Accelerate Scale out / Parallelize NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PyData Multi-core and Distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures Dask
  • 33. 33 Road to 1.0 October 2018 - RAPIDS 0.1 cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU Gradient Boosted Decision Trees (GBDT) GLM Logistic Regression Random Forest (regression) K-Means K-NN DBSCAN UMAP ARIMA Kalman Filter Holts-Winters Principal Components Singular Value Decomposition
  • 34. 34 Road to 1.0 June 2019 - RAPIDS 0.8 cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU Gradient Boosted Decision Trees (GBDT) GLM Logistic Regression Random Forest (regression) K-Means K-NN DBSCAN UMAP ARIMA Kalman Filter Holts-Winters Principal Components Singular Value Decomposition
  • 35. 35 Road to 1.0 Q4 - 2019 - RAPIDS 0.12? cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU Gradient Boosted Decision Trees (GBDT) GLM Logistic Regression Random Forest (regression) K-Means K-NN DBSCAN UMAP ARIMA Kalman Filter Holts-Winters Principal Components Singular Value Decomposition
  • 36. 36 Road to 1.0 Focused on robust functionality, deployment, and user experience Integration with every major cloud provider Both containers and cloud specific machine instances Support for Enterprise and HPC Orchestration Layers
  • 37. 37 • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://hub.docker.com/r/rapidsai/rapidsai/ • https://github.com/rapidsai • https://anaconda.org/rapidsai/ RAPIDS How do I get the software?
  • 38. 38 Additional Reading Material • Python, Performance and GPUs (Matthew Rocklin): https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d?ncid=so-twi-n2- 96487&linkId=100000006881312 • NEP-18: A Dispatch Mechanism for NumPy’s high level array functions (Stephan Hoyer, et al.): https://www.numpy.org/neps/nep-0018-array-function-protocol.html • uarray update: API changes, overhead and comparison to __array_function__ (Hameer Abbasi): https://labs.quansight.org/blog/2019/07/uarray-update-api-changes-overhead-and- comparison-to-__array_function__/
  • 39. THANK YOU Peter Andreas Entschev [email protected] @PeterEntschev