SlideShare a Scribd company logo
Vahid Amiri
NationalWorkshop of Cloud Computing
Cloud Computing Lab- Amirkabir University
Vahidamiry.ir
Nov 2012
Bio-Informatics and Life Sciences
Computational Electromagnetics and
Electrodynamics
Computational Finance
Weather, Atmospheric, Ocean Modeling
and Space Sciences
2
Computational Fluid Dynamics
Data Mining, Analytics, and Databases
Molecular Dynamics
Numerical Analytics
3
 Cluster
 Grid
 Cloud
4
 Parallel and distributed processing system
 Consists of a collection of interconnected stand-
alone computers
 Appears as a single system to users and
applications
5
6
 Distributed, heterogeneous resources for large
experiments
 Compute and storage resources
 Network of Machines
 Larger number of resources
 Extended all over the world
 Different administrative domains
7
 Investment in infrastructure
 Power and Cooling Management
 Management
 Maintenance
 Complexity
 cost
8
 Computing as a utility
 Easy to access
▪ Easy Configuration
 Pay-as-you-go
 Flexibility
 Scalability
 No need to infrastructure management
9
 IaaS
 Cloud-Based Cluster
▪ Amazon EC2
▪ GoGrid
▪ IBM
▪ Rackspace
 PaaS
 Amazon Elastic MapReduce
 GoogleApp Engine – MapReduce Service
 SaaS
10
11
12
 Develops software solutions for applications in
the cloud
 CloudBroker
 Cyclone
 Plura Processing
 Penguin on Demand
13
 Supporting five technical domains:
 Computational fluid dynamics (CFD)
 Finite element analysis
 Computational chemistry and materials
 Computational biology
14
 performance penalties
 users voluntarily lose almost all control on the
execution environment
 VirtualizationTechnology
▪ These are related to the performance loss introducedby
the virtualization mechanism
 Cloud Environment
▪ due to overheads and to the sharing of computing and
communication resources
15
 IaaS HPC
 MPI Cluster
 MapReduce Cluster
 GPU Cluster!!!
 …..
16
17
 General Purpose computation using GPU
 Data parallel algorithms leverageGPU
attributes
 Using graphic hardware for non-graphic computations
 Can improve the performance in the orders
of magnitude in certain types of
applications
18
 GPUs contain much larger number of dedicated
ALUs then CPUs.
 GPUs also contain extensive support of Stream
Processing paradigm. It is related to SIMD (
Single Instruction Multiple Data) processing.
 Each processing unit on GPU contains local
memory that improves data manipulation and
reduces fetch time.
19
20
21
 Multiprocessor(MP) = thread processor = ALU
22
 The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
 Data-parallel portions of an application are
executed on the device as kernels which run in
parallel on many threads
 Differences between GPU and CPU threads
 GPU threads are extremely lightweight
▪ Very little creation overhead
 GPU needs 1000s of threads for full efficiency
▪ Multi-core CPU needs only a few
23
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)
24
 CUDA is a set of developing tools to create applications that will
perform execution on GPU (Graphics Processing Unit).
 The API is an extension to the ANSI C programming language
 Low learning curve
 CUDA was developed by NVidia and as such can only run on
NVidia GPUs of G8x series and up.
 CUDA was released on February 15, 2007 for PC and Beta version
for MacOS X on August 19, 2008.
25
26
 A kernel is executed as a grid of thread
blocks
 A thread block is a batch of threads that
can cooperate with each other by:
 Synchronizing their execution
 Efficiently sharing data through a low
latency shared memory
Host
Kern
el 1
Kern
el 2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
27
 Threads and blocks have IDs
 So each thread can decide
what data to work on
 Simplifies memory
addressing when processing
multidimensional data
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
28
 Parallel computations arranged as
grids
 One grid executes after another
 Blocks assigned to SM. A single
block assigned to a single SM.
Multiple blocks can be assigned to a
SM.
 Block consists of elements (threads)
29
30
31
 Demo!
32
 CUDA device driver
 CUDA Software Development Kit
 CUDAToolkit
 You (probably) need experience with C or C++
33
 Thread block – an array of concurrent threads
that execute the same program and can
cooperate to compute the result
 A thread ID has corresponding 1,2 or 3d
indices
 Threads of a thread block share memory
34
 Each thread can:
 R/W per-thread registers
 R/W per-thread local memory
 R/W per-block shared memory
 R/W per-grid global memory
 Read only per-grid constant memory
 Read only per-grid texture memory
 The host can R/W global,
constant, and texture
memories
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
35
 cudaMalloc()
 Allocates object in the device Global Memory
 Requires two parameters
▪ Address of a pointer to the allocated object
▪ Size of of allocated object
 cudaFree()
 Frees object from deviceGlobal Memory
BLOCK_SIZE = 64;
Float d_f;
int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
cudaMalloc((void**)&d_f, size);
cudaFree(Md.elements);
36
 cudaMemcpy()
 memory data transfer
 Requires four parameters
▪ Pointer to source
▪ Pointer to destination
▪ Number of bytes copied
▪ Type of transfer
▪ Host to Host
▪ Host to Device
▪ Device to Host
▪ Device to Device
cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice);
cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost);
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Block (1, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Host
37
 __global__ defines a kernel function
 Must return void
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
38
 Allocate the memory on the GPU
 Copy the arrays ‘a’ and ‘b’ to the GPU
 Call the kernel function
 Copy the array ‘c’ back from the GPU to the CPU
 Free the memory allocated on the GPU
39
 Step 1: Allocate the memory on the GPU
int a[N], b[N], c[N];
int *d_a, *d_b, *d_c;
cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ;
40
 Step 2: Copy the arrays ‘a’ and ‘b’ to the GPU
cudaMemCpy(d_a, a, N * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemCpy(d_b, b, N * sizeof(int),
cudaMemcpyHostToDevice);
 Step 3: Call the kernel function
Add<<<N,1>>>(d_a, d_b, d_c);
41
 Step 4: Copy the array ‘c’ back from the GPU to the
CPU
cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
 Step 5: Free the memory allocated on the GPU
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
42
 kernel function
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x;
if (tid < N)
c[tid] = a[tid] + b[tid];
}
43
 We’ve seen parallel vector addition using:
 Several blocks with one thread each
 One block with several threads
44
45
46
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
 P = M * N of size WIDTH xWIDTH
 One thread handles one element of P
 M and N are loadedWIDTH times from
global memory
M
N
P
WIDTHWIDTH
WIDTH WIDTH
47
 Memory latency can be hidden by keeping a
large number of threads busy
 Keep number of threads per block (block size)
and number of blocks per grid (grid size) as
large as possible
 Constant memory can be used for constant
data (variables that do not change).
 Constant memory is cached.
48
49
 Each thread within the
block computes one
element of Csub
50
51
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
 Recall that the “stream processors” of the
GPU are organized as MPs (multi-processors)
and every MP has its own set of resources:
 Registers
 Local memory
 The block size needs to be chosen such that
there are enough resources in an MP to
execute a block at a time.
52
 Critical for performance
 Recommended value is 192 or 256
 Maximum value is 512
 Limited by number of registers on the MP
53
 Run with the different block sizes!!
M
N
P
WIDTHWIDTH
WIDTH WIDTH
54
55
0.3 6.7 47
1079
2537
0.3 4.6 5.6 39.3
86.6126 126
19
407
947
0
500
1000
1500
2000
2500
3000
S 128 S 512 S 1024 S 3079 S 4096
block-16
Data - Test 1
Shared
56
0.3 6.7 47
1079
2537
126 126
19
407
947
2948
73 86 73
0
500
1000
1500
2000
2500
3000
3500
S 128 S 512 S 1024 S 3079 S 4096
block-16
Shared
block-32
block-64
block-128
block-512
57
Ad

More Related Content

What's hot (20)

Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
Iraklis Psaroudakis
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
Grisha Weintraub
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Hadoop
HadoopHadoop
Hadoop
ronit gaikwad
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
DataStax
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
Kamal A
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
Narendranath Reddy T
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
Ontico
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Thirunavukkarasu Ps
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
datastack
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
DataStax
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
Grisha Weintraub
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
DataStax
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
Kamal A
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
Ontico
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
datastack
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
DataStax
 

Similar to Gpu computing workshop (20)

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
PriyankaSaini94
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
CUDA
CUDACUDA
CUDA
Rachel Miller
 
Linux Conference Australia 2018 : Device Tree, past, present, future
Linux Conference Australia 2018 : Device Tree, past, present, futureLinux Conference Australia 2018 : Device Tree, past, present, future
Linux Conference Australia 2018 : Device Tree, past, present, future
Neil Armstrong
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
coolmirza143
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
Taras Zakharchenko
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
MEPCO Schlenk Engineering College
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Linux Conference Australia 2018 : Device Tree, past, present, future
Linux Conference Australia 2018 : Device Tree, past, present, futureLinux Conference Australia 2018 : Device Tree, past, present, future
Linux Conference Australia 2018 : Device Tree, past, present, future
Neil Armstrong
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Ad

Recently uploaded (20)

Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Ad

Gpu computing workshop

  • 1. Vahid Amiri NationalWorkshop of Cloud Computing Cloud Computing Lab- Amirkabir University Vahidamiry.ir Nov 2012
  • 2. Bio-Informatics and Life Sciences Computational Electromagnetics and Electrodynamics Computational Finance Weather, Atmospheric, Ocean Modeling and Space Sciences 2
  • 3. Computational Fluid Dynamics Data Mining, Analytics, and Databases Molecular Dynamics Numerical Analytics 3
  • 5.  Parallel and distributed processing system  Consists of a collection of interconnected stand- alone computers  Appears as a single system to users and applications 5
  • 6. 6
  • 7.  Distributed, heterogeneous resources for large experiments  Compute and storage resources  Network of Machines  Larger number of resources  Extended all over the world  Different administrative domains 7
  • 8.  Investment in infrastructure  Power and Cooling Management  Management  Maintenance  Complexity  cost 8
  • 9.  Computing as a utility  Easy to access ▪ Easy Configuration  Pay-as-you-go  Flexibility  Scalability  No need to infrastructure management 9
  • 10.  IaaS  Cloud-Based Cluster ▪ Amazon EC2 ▪ GoGrid ▪ IBM ▪ Rackspace  PaaS  Amazon Elastic MapReduce  GoogleApp Engine – MapReduce Service  SaaS 10
  • 11. 11
  • 12. 12
  • 13.  Develops software solutions for applications in the cloud  CloudBroker  Cyclone  Plura Processing  Penguin on Demand 13
  • 14.  Supporting five technical domains:  Computational fluid dynamics (CFD)  Finite element analysis  Computational chemistry and materials  Computational biology 14
  • 15.  performance penalties  users voluntarily lose almost all control on the execution environment  VirtualizationTechnology ▪ These are related to the performance loss introducedby the virtualization mechanism  Cloud Environment ▪ due to overheads and to the sharing of computing and communication resources 15
  • 16.  IaaS HPC  MPI Cluster  MapReduce Cluster  GPU Cluster!!!  ….. 16
  • 17. 17
  • 18.  General Purpose computation using GPU  Data parallel algorithms leverageGPU attributes  Using graphic hardware for non-graphic computations  Can improve the performance in the orders of magnitude in certain types of applications 18
  • 19.  GPUs contain much larger number of dedicated ALUs then CPUs.  GPUs also contain extensive support of Stream Processing paradigm. It is related to SIMD ( Single Instruction Multiple Data) processing.  Each processing unit on GPU contains local memory that improves data manipulation and reduces fetch time. 19
  • 20. 20
  • 21. 21
  • 22.  Multiprocessor(MP) = thread processor = ALU 22
  • 23.  The GPU is viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel  Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads  Differences between GPU and CPU threads  GPU threads are extremely lightweight ▪ Very little creation overhead  GPU needs 1000s of threads for full efficiency ▪ Multi-core CPU needs only a few 23
  • 24.  Host The CPU and its memory (host memory)  Device The GPU and its memory (device memory) 24
  • 25.  CUDA is a set of developing tools to create applications that will perform execution on GPU (Graphics Processing Unit).  The API is an extension to the ANSI C programming language  Low learning curve  CUDA was developed by NVidia and as such can only run on NVidia GPUs of G8x series and up.  CUDA was released on February 15, 2007 for PC and Beta version for MacOS X on August 19, 2008. 25
  • 26. 26
  • 27.  A kernel is executed as a grid of thread blocks  A thread block is a batch of threads that can cooperate with each other by:  Synchronizing their execution  Efficiently sharing data through a low latency shared memory Host Kern el 1 Kern el 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 27
  • 28.  Threads and blocks have IDs  So each thread can decide what data to work on  Simplifies memory addressing when processing multidimensional data Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 28
  • 29.  Parallel computations arranged as grids  One grid executes after another  Blocks assigned to SM. A single block assigned to a single SM. Multiple blocks can be assigned to a SM.  Block consists of elements (threads) 29
  • 30. 30
  • 31. 31
  • 33.  CUDA device driver  CUDA Software Development Kit  CUDAToolkit  You (probably) need experience with C or C++ 33
  • 34.  Thread block – an array of concurrent threads that execute the same program and can cooperate to compute the result  A thread ID has corresponding 1,2 or 3d indices  Threads of a thread block share memory 34
  • 35.  Each thread can:  R/W per-thread registers  R/W per-thread local memory  R/W per-block shared memory  R/W per-grid global memory  Read only per-grid constant memory  Read only per-grid texture memory  The host can R/W global, constant, and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host 35
  • 36.  cudaMalloc()  Allocates object in the device Global Memory  Requires two parameters ▪ Address of a pointer to the allocated object ▪ Size of of allocated object  cudaFree()  Frees object from deviceGlobal Memory BLOCK_SIZE = 64; Float d_f; int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&d_f, size); cudaFree(Md.elements); 36
  • 37.  cudaMemcpy()  memory data transfer  Requires four parameters ▪ Pointer to source ▪ Pointer to destination ▪ Number of bytes copied ▪ Type of transfer ▪ Host to Host ▪ Host to Device ▪ Device to Host ▪ Device to Device cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice); cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost); (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Host 37
  • 38.  __global__ defines a kernel function  Must return void Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host 38
  • 39.  Allocate the memory on the GPU  Copy the arrays ‘a’ and ‘b’ to the GPU  Call the kernel function  Copy the array ‘c’ back from the GPU to the CPU  Free the memory allocated on the GPU 39
  • 40.  Step 1: Allocate the memory on the GPU int a[N], b[N], c[N]; int *d_a, *d_b, *d_c; cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ; 40
  • 41.  Step 2: Copy the arrays ‘a’ and ‘b’ to the GPU cudaMemCpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemCpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);  Step 3: Call the kernel function Add<<<N,1>>>(d_a, d_b, d_c); 41
  • 42.  Step 4: Copy the array ‘c’ back from the GPU to the CPU cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);  Step 5: Free the memory allocated on the GPU cudaFree( d_a ); cudaFree( d_b ); cudaFree( d_c ); 42
  • 43.  kernel function __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; if (tid < N) c[tid] = a[tid] + b[tid]; } 43
  • 44.  We’ve seen parallel vector addition using:  Several blocks with one thread each  One block with several threads 44
  • 45. 45
  • 46. 46 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host
  • 47.  P = M * N of size WIDTH xWIDTH  One thread handles one element of P  M and N are loadedWIDTH times from global memory M N P WIDTHWIDTH WIDTH WIDTH 47
  • 48.  Memory latency can be hidden by keeping a large number of threads busy  Keep number of threads per block (block size) and number of blocks per grid (grid size) as large as possible  Constant memory can be used for constant data (variables that do not change).  Constant memory is cached. 48
  • 49. 49
  • 50.  Each thread within the block computes one element of Csub 50
  • 51. 51 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host
  • 52.  Recall that the “stream processors” of the GPU are organized as MPs (multi-processors) and every MP has its own set of resources:  Registers  Local memory  The block size needs to be chosen such that there are enough resources in an MP to execute a block at a time. 52
  • 53.  Critical for performance  Recommended value is 192 or 256  Maximum value is 512  Limited by number of registers on the MP 53
  • 54.  Run with the different block sizes!! M N P WIDTHWIDTH WIDTH WIDTH 54
  • 55. 55 0.3 6.7 47 1079 2537 0.3 4.6 5.6 39.3 86.6126 126 19 407 947 0 500 1000 1500 2000 2500 3000 S 128 S 512 S 1024 S 3079 S 4096 block-16 Data - Test 1 Shared
  • 56. 56 0.3 6.7 47 1079 2537 126 126 19 407 947 2948 73 86 73 0 500 1000 1500 2000 2500 3000 3500 S 128 S 512 S 1024 S 3079 S 4096 block-16 Shared block-32 block-64 block-128 block-512
  • 57. 57