0% found this document useful (0 votes)
50 views

HPC Concepts in Data Science

This document discusses high performance computing (HPC) concepts in data science. It introduces parallel programming tools for HPC including OpenMP, MPI, CUDA, and Python packages like Numba, Dask, cuDNN, TensorFlow, and PyTorch. These tools allow scaling of data science workloads across multiple processors and GPUs using techniques like parallel programming, just-in-time compilation, and distributed computing. The document also summarizes the key topics covered in Module 1 on HPC concepts for data science.

Uploaded by

vibhuti rajpal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

HPC Concepts in Data Science

This document discusses high performance computing (HPC) concepts in data science. It introduces parallel programming tools for HPC including OpenMP, MPI, CUDA, and Python packages like Numba, Dask, cuDNN, TensorFlow, and PyTorch. These tools allow scaling of data science workloads across multiple processors and GPUs using techniques like parallel programming, just-in-time compilation, and distributed computing. The document also summarizes the key topics covered in Module 1 on HPC concepts for data science.

Uploaded by

vibhuti rajpal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CDS HPC in Data Science

HPC concepts in
Data Science

1
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science

HPC System
A et
PPI
yuh r
l
MPI

EE I
If 2
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science

Parallel Programming Tools


OpenMP (SMP)

• shared memory directives


• to define work decomposition NOT data decomposition
• synchronisation is implicit (can also be used-defined)

MPI (Message Passing Interface)

• user specifies how work and data are distributed


• user specifies how and when communication has to be done
• by calling MPI communication library-routines

• CUDA
• general-purpose parallel computing platform and programming model for NVIDIA GPUs
• allows developers to use C++ as a high-level programming language

• OpenCL/OpenACC/AMD ROCm …
• Python is powerful but does not scale well
3
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

NUMBA - not only a JIT compiler

• @jit - just-in-time compiler for Python that works best on code that uses NumPy arrays
and functions, and loops
• @vectorize - produces NumPy ufunc s (with all the ufunc methods supported)

• @guvectorize - produces NumPy generalized ufunc s

• @stencil - declare a function as a kernel for a stencil like operation

• @jitclass - for jit aware classes

• @cfunc - declare a function for use as a native call back (to be called from C/C++ etc)

• @overload - register your own implementation of a function for use in nopython mode

• parallel = True - enable the automatic parallelization of the function in SMP

• fastmath = True - enable fast-math behaviour for the function

• Numba can target Nvidia CUDA and (experimentally) AMD ROC GPUs

https://numba.pydata.org/numba-doc/latest/user/5minguide.html
4
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

cuDNN

https://developer.nvidia.com/cudnn 5
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages
Dask
• provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively,
with minimal rewriting
• Dask Scales out to Clusters
• Dask Scales Down to Single Computers
• Dask Supports Complex Applications in most case, but standard big-data
technology like MapReduce or Spark is used for more complex problems (Module 3)
• A kind of “Hadoop/Spark for Python”

https://https://docs.dask.org/en/latest/why.html 6
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

Dask

https://docs.dask.org/en/latest/scheduling.html 7
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

https://docs.dask.org/en/latest/cheatsheet.html 8
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

https://www.tensorflow.org/guide/distributed_training 9
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

https://pytorch.org/tutorials/intermediate/dist_tuto.html

10
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

https://www.tensorflow.org/guide/distributed_training 11
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

See module 3

https://spark.apache.org/docs/latest/api/python/index.html 12
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

See module 3

https://hadoop.apache.org

13
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science
Parallel Programming Tools/Packages

Summary of Module 1

• Calculus for Machine Learning


• Numerical Computation
• Object-oriented programming (OOP)
• OOP concepts in Data Science
• Numba: Just-In-Time (JIT) compiler for Python
• Parallel Architectures
• Parallel programming with MPI
• Parallel programming with OpenMP.
• Accelerated computing using GPU
• Data Science tools that use HPC concepts
• Numba, Dask, cuDNN, TensorFlow / PyTorch

14
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS HPC in Data Science

Thank you for your attention


and interesting questions !

15
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

You might also like