0% found this document useful (0 votes)
4 views12 pages

COE4590_16_GPU2

The document discusses the architecture of Graphics Processing Units (GPUs), focusing on the organization of threads, blocks, and grids for parallel processing. It explains how threads are lightweight and grouped into blocks, which are further organized into grids, allowing for efficient data processing. The document also illustrates the execution of threads in a multi-threaded operation using an example of counting occurrences of a number in an array.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

COE4590_16_GPU2

The document discusses the architecture of Graphics Processing Units (GPUs), focusing on the organization of threads, blocks, and grids for parallel processing. It explains how threads are lightweight and grouped into blocks, which are further organized into grids, allowing for efficient data processing. The document also illustrates the execution of threads in a multi-threaded operation using an example of counting occurrences of a number in an array.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Distributed and Parallel System

(Lecture-16)
Graphics Processing Unit (GPU)
(II Lecture)

BY
ABDUS SAMAD
Threads, Blocks, and Grids
 A thread is associated with each data element.
 GPU supports 1000s lightweight threads as
compared to CPU which needs only a few heavy
ones.
 Threads are organized into blocks.
 A whole thread block consists of many elements
which are executed per thread at a time.
 Blocks are organized into a grid. Blocks are
executed independently and in any order.
 Thread management handled by GPU hardware
not by applications or OS:
How a Thread works?
 Example: How many times 6 appear in an array
of 16 elements?
 For single threaded operation there is a scan of
16 nos. sequentially and result of match is
counted in a counter.
 For a multi-threaded operation lets there are 4
threads.
 Each thread will examine 4 elements of thread
running on one GPU.
 There is one grid consisting of one block.
How a Thread works?
3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

 Instead of sequential data distribution here


Cyclic Data Distribution.
 Thread 0 examines array elements 0, 4, 8, 12
 Thread 1 examines array elements 1, 5, 9, 13
 Thread 2 examines array elements 2, 6, 10, 14
 Thread 3 examines array elements 3, 7, 11, 15
Threads, Blocks, and Grids
 Thread: Consists of 32 elements
 Block: Consist of 16 threads forming a code of 512
elements
 Grid: consists of 16 blocks forming a code of 8192
elements that works over all elements
 Thread Block:

Analogous to a strip-mined vector loop with vector
length of 32.

Breaks down the vector into manageable set of
vector elements.
 32 elements/thread x 16 SIMD threads/block = 512 elements/block
 SIMD instruction executes 32 elements at a time
 Grid size = 8192 Vector elements / 512 elements/block = 16 blocks.
Threads, Blocks, and Grids
 Threads are grouped into
thread blocks.
 Blocks are grouped into a
single grid.
 The grid is executed on the
GPU as a kernel.
 A kernel is executed as a
grid of thread blocks.
 Thread from the same block
cooperate and share
memory space.
Threads, Blocks, and Grids
 A thread block is a batch of
threads that can cooperate
with each other by:

synchronization of their
execution for hazard-free
share memory accesses.

efficient sharing of data
through a low latency
shared memory.
 Threads from different
blocks can not cooperate/
share the data.
Block and Thread IDs
 Thread and blocks have
IDs:

Block IDs are either
1D or 2D.
Block (x, y)

Thread IDs may be
1D, 2D or 3D.
Thread (x, y, z)
 Simplifies memory
addresses when
processing
multidimensional data.
 32 elements per
thread
 16 threads/block
 512 Threads/block
 Grid consists of
entire code
 Grid size = 8192
 It requires 8192/512
(elements/block)
= 16 Blocks
 Block is assigned
to multithreaded
processor
Sample GPU SIMT Code (Reductions)
 Example:

Transfor
m To

 First two operations are done in parallel on different


GPUs
 Last operation is done serially
Thanks

You might also like