HPC Concepts in Data Science
HPC Concepts in Data Science
HPC concepts in
Data Science
1
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
HPC System
A et
PPI
yuh r
l
MPI
EE I
If 2
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
• CUDA
• general-purpose parallel computing platform and programming model for NVIDIA GPUs
• allows developers to use C++ as a high-level programming language
• OpenCL/OpenACC/AMD ROCm …
• Python is powerful but does not scale well
3
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
• @jit - just-in-time compiler for Python that works best on code that uses NumPy arrays
and functions, and loops
• @vectorize - produces NumPy ufunc s (with all the ufunc methods supported)
• @cfunc - declare a function for use as a native call back (to be called from C/C++ etc)
• @overload - register your own implementation of a function for use in nopython mode
• Numba can target Nvidia CUDA and (experimentally) AMD ROC GPUs
https://numba.pydata.org/numba-doc/latest/user/5minguide.html
4
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
cuDNN
https://developer.nvidia.com/cudnn 5
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
Dask
• provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively,
with minimal rewriting
• Dask Scales out to Clusters
• Dask Scales Down to Single Computers
• Dask Supports Complex Applications in most case, but standard big-data
technology like MapReduce or Spark is used for more complex problems (Module 3)
• A kind of “Hadoop/Spark for Python”
https://https://docs.dask.org/en/latest/why.html 6
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
Dask
https://docs.dask.org/en/latest/scheduling.html 7
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
https://docs.dask.org/en/latest/cheatsheet.html 8
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
https://www.tensorflow.org/guide/distributed_training 9
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
https://pytorch.org/tutorials/intermediate/dist_tuto.html
10
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
https://www.tensorflow.org/guide/distributed_training 11
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
See module 3
https://spark.apache.org/docs/latest/api/python/index.html 12
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
See module 3
https://hadoop.apache.org
13
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
Summary of Module 1
14
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
15
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore