GPU Architecture Ebook
GPU Architecture Ebook
Programming
Serial execution:
Instructions are executed
one after another on a
single Central Processing
Unit (CPU)
Problems:
• More expensive to
produce
•More expensive to run
•Bus speed limitation
Parallel Computing
Official-sounding definition: The simultaneous use of multiple
compute resources to solve a computational problem.
Benefits:
• Economical – requires less power !!! and cheaper to produce
• Better performance – bus/bottleneck issue
Limitations:
• New architecture – Von Neumann is all we know!
• New debugging difficulties – cache consistency issue
Processes and Threads
• Traditional process
– One thread of control through a large, potentially
sparse address space
– Address space may be shared with other processes
(shared mem)
– Collection of systems resources (files, semaphores)
• Thread (light weight process)
– A flow of control through an address space
– Each address space can have multiple concurrent
control flows
– Each thread has access to entire address space
– Potentially parallel execution, minimal state (low
overheads)
– May need synchronization to control access to shared
variables
Threads
• Each thread has its own stack, PC, registers
– Share address space, files,…
Flynn’s Taxonomy
Classification of computer architectures, proposed by Michael J. Flynn
•SISD – traditional serial architecture in computers.
•SIMD – parallel computer. One instruction is executed many times with
different data (think of a for loop indexing through an array)
•MISD - Each processing unit operates on the data independently via
independent instruction streams. Not really used in parallel
•MIMD – Fully parallel and the most common form of parallel
computing.
What is GPGPU ?
• General Purpose computation using GPU
in applications other than 3D graphics
– GPU accelerates critical path of application
• Data parallel algorithms leverage GPU attributes
– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Low-latency floating point (FP) computation
• Applications
– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting
• GPU – graphics processing unit
150 GF 1,3 TF
CUDA Processor Terminology
• SPA
– Streaming Processor Array (variable across GeForce 8-series, 8 in
GeForce8800)
• TPC
– Texture Processor Cluster (2 SM + TEX)
• SM
– Streaming Multiprocessor (8 SP)
– Multi-threaded processor core
– Fundamental processing unit for CUDA thread block
• SP
– Streaming Processor
– Scalar ALU for a single CUDA thread
11
Streaming Multiprocessor (SM)
• Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
Streaming Multiprocessor
– 2 Super Function Units (SFU) Instruction L1 Data L1
• 16 KB shared memory SP SP
12
Streaming Multiprocessor (SM)
- Each SM has 8 Scalar Processors (SP)
• Fully pipelined
GPU computing
GPU: Graphics Processing Unit
Traditionally used for real-time rendering
High Computational density and memory bandwidth
Throughput processor: 1000s of concurrent threads to hide
latency
CPU GPU
2.6 x 4 x 16 =
166.4 Gigaflops double precision 1.66 Teraflops double precision
SCC CPU SCC GPU
Grid 1
Grid 2
Kernel 2
Block (1, 1)
• Registers:
o Fastest.
o Only accessible by a thread.
o Lifetime of a thread
• Shared memory:
o Could be as fast as registers if no bank conflicts or
reading from same address.
o Accessible by any threads within a block where it was
created.
o Lifetime of a block.
CUDA - Memory Units Description
(continue)
• Global Memory:
o Up to 150x slower then registers or share memory.
o Accessible from either host or device.
o Lifetime of an application.
• Local Memory
o Resides in global memory. Can be 150x slower then
registers and shared memory.
o Accessible only by a thread.
o Lifetime of a thread.
NVCC compiler
•Compiles C or PTX
code (CUDA
instruction set
architecture)
•Compiles to either
PTX code or binary
(cubin object)
Development: Basic Idea
1. Allocate equal size of memory for both host
and device
2. Transfer data from host to device
3. Execute kernel to compute on data
4. Transfer data back to host
Kernel Function Qualifiers
• __device__
• __global__
• __host__
Example in C:
CPU program
void increment_cpu(float *a, float b, int N)
CUDA program
__global__ void increment_gpu(float *a, float b, int N)
Variable Type Qualifiers
• Specify how a variable is stored in memory
• __device__
• __shared__
• __constant__
Example:
__global__ void increment_gpu(float *a, float b, int N)
{
__shared__ float shared[];
}
Calling the Kernel
• Calling a kernel function is much different from
calling a regular function
void main(){
int blocks = 256;
int threadsperblock = 512;
mycudafunc<<<blocks,threadsperblock>>>(some
parameter);
}
CUDA: Hello, World! example
return(0);
}
CUDA: Hello, World! example
}
CUDA: Hello, World! example
Hello Cuda!
Hello from GPU: thread 0 and block 0
Hello from GPU: thread 1 and block 0 Note:
. . . Threads are executed on "first
come, first serve" basis. Can
Hello from GPU: thread 6 and block 2 not expect any order!
Hello from GPU: thread 7 and block 2
Welcome back to CPU!
GPU Memory Allocation / Release
Host (CPU) manages GPU memory:
• cudaMalloc (void ** pointer, size_t nbytes)
• cudaMemset (void * pointer, int value, size_t count);
• cudaFree (void* pointer)
Void main(){
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
}
Memory Transfer
cudaMemcpy( void *dst, void *src, size_t nbytes, enum
cudaMemcpyKind direction);
• returns after the copy is complete blocks CPU
• thread doesn’t start copying until previous CUDA calls
complete
enum cudaMemcpyKind
• cudaMemcpyHostToDevice
• cudaMemcpyDeviceToHost
• cudaMemcpyDeviceToDevice
Host Synchronization
All kernel launches are asynchronous
• control returns to CPU immediately
• kernel starts executing once all previous CUDA calls have completed
Memcopies are synchronous
• control returns to CPU once the copy is complete
• copy starts once all previous CUDA calls have completed
cudaThreadSynchronize()
• blocks until all previous CUDA calls complete
Asynchronous CUDA calls provide:
• non-blocking memcopies
• ability to overlap memcopies and kernel execution
The Big Difference
CPU program GPU program
void increment_cpu(float *a, float b, int N) __global__ void increment_gpu(float *a, float b, int
{ N)
for (int idx = 0; idx<N; idx++) {
a[idx] = a[idx] + b; int idx = blockIdx.x * blockDim.x + threadIdx.x;
} if( idx < N) a[idx] = a[idx] + b;
}
void main() void main() {
{ …..
… dim3 dimBlock (blocksize);
increment_cpu(a, b, N); dim3 dimGrid( ceil( N / (float)blocksize) )
increment_gpu<<<dimGrid, dimBlock>>>(a, b,
} N);
}
CUDA: Vector Addition example
/* Main function, executed on host (CPU) */
int main( void) {
return(0);
}
CUDA: Vector Addition example
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
CUDA: Vector Addition example
Block # 0 Block # 1
/* CUDA Kernel */
__global__ void vectorAdd( const float *A,
const float *B,
float *C,
int numElements) {
intro_gpu/vectorAdd/vectorAdd.cu
Mechanics of Using Shared
Memory
• __shared__ type qualifier required
• Must be allocated from global/device function, or as
“extern”
• Examples:
extern __shared__ float d_s_array[]; __global__ void compute2() {
__shared__ float d_s_array[M];
/* a form of dynamic allocation */
/* MEMSIZE is size of per-block */ /* create or copy from global memory */
/* shared memory*/ d_s_array[j] = …;
__host__ void outerCompute() {
compute<<<gs,bs,MEMSIZE>>>(); /* write result back to global memory */
} d_g_array[j] = d_s_array[j];
__global__ void compute() { }
d_s_array[i] = …;
}
Optimization using Shared Memory
10x GPU Computing Growth
2008 2015
6,000 450,000
Tesla GPUs Tesla GPUs
150K 3M
CUDA downloads CUDA downloads
77 54,000
Supercomputing Teraflops Supercomputing
Teraflops
60
University Courses 800
University Courses
4,000
Academic Papers
60,000
Academic Papers
GPU Acceleration
Applications
GPU-accelerated OpenACC Programming
libraries Directives Languages
Rest of Sequential
CPU Code
Compute-Intensive
GPU Functions CPU
Use GPU to
Parallelize
+
Will Execution on a GPU Accelerate My
Application?
C OpenACC, CUDA
Python PyCUDA