0% found this document useful (0 votes)

91 views

Gpu History and Cuda Programming Basics

The document provides an overview of CUDA programming basics including: 1) CUDA uses a parallel programming model where kernels are launched by blocks of threads that execute across multiple streaming processors. 2) Memory is managed across CPU and GPU with different memory spaces requiring data transfers between them. 3) Examples are provided of basic vector addition kernels to demonstrate launching grids of blocks and threads as well as memory allocation and data transfers between CPU and GPU.

Uploaded by

Fransiskus Yoga Esa Wibowo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views

Gpu History and Cuda Programming Basics

Uploaded by

Fransiskus Yoga Esa Wibowo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

LECTURE 2: GPU HISTORY &

CUDA PROGRAMMING
BASICS
Outline of CUDA Basics

Basic Kernels and Execution on GPU

Basic Memory Management
Coordinating CPU and GPU Execution

See the Programming Guide for the full API

BASIC KERNELS AND
EXECUTION ON GPU
CUDA Programming Model

Parallel code (kernel) is launched and

executed on a device by many threads
Launches are hierarchical
Threads are grouped into blocks
Blocks are grouped into grids
Familiar serial code is written for a thread
Each thread is free to execute a unique code
path
Built-in thread and block ID variables
High Level View

SMEM

SMEM
PCIe CPU
Global Memory
Chipset
Blocks of threads run on an SM
Streaming Processor Streaming Multiprocessor

SMEM
Threadblock
Thread
Per-block
Registers Memory Shared
Memory

Memory
Whole grid runs on GPU

Many blocks of threads

...
SMEM

SMEM

SMEM
Global Memory
Thread Hierarchy

Threads launched for a parallel section are

partitioned into thread blocks
Grid = all blocks for a given launch
Thread block is a group of threads that can:
Synchronize their execution
Communicate via shared memory
Memory Model

Kernel 0
Sequential
... Kernels
Per-device
Global
Kernel 1
Memory

...
Memory Model

Device 0
memory
Host memory cudaMemcpy()
Device 1
memory
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}

int main()
{
// Run grid of N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
Host Code
int main()
{
// Run grid of N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Host code for vecAdd

// allocate and initialize host (CPU) memory

float *h_A = …, *h_B = …; *h_C = …(empty)

// allocate device (GPU) memory

float *d_A, *d_B, *d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device

cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );

// execute grid of N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd (2)

// execute grid of N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

// copy result back to host memory

cudaMemcpy( h_C, d_C, N * sizeof(float),
cudaMemcpyDeviceToHost) );

// do something with the result…

// free device (GPU) memory

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
Kernel Variations and Output

global void kernel( int *a )

{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7; Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
}

global void kernel( int *a )

{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = blockIdx.x; Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
}

global void kernel( int *a )

{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = threadIdx.x; Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
}
Code executed on GPU

C/C++ with some restrictions:

Can only access GPU memory
No variable number of arguments
No static variables
No recursion
No dynamic polymorphism

Must be declared with a qualifier:

__global__ : launched by CPU,
cannot be called from GPU must return void
__device__ : called from other GPU functions,
cannot be called by the CPU
__host__ : can be called by CPU
__host__ and __device__ qualifiers can be combined
sample use: overloading operators
Memory Spaces
CPU and GPU have separate memory
spaces
Data is moved across PCIe bus
Use functions to allocate/set/copy memory on
GPU
Very similar to corresponding C functions

Pointers are just addresses

Can’t tell from the pointer value whether the
address is on CPU or GPU
Must exercise care when dereferencing:
Dereferencing CPU pointer on GPU will likely crash
Same for vice versa
GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:

cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t
count)
cudaFree (void* pointer)

int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Data Copies

cudaMemcpy( void dst, void src, size_t nbytes,

enum cudaMemcpyKind direction);
returns after the copy is complete
blocks CPU thread until all bytes have been copied
doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Non-blocking copies are also available
Code Walkthrough 1
// walkthrough1.cu
#include <stdio.h>

int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);

int d_a=0, h_a=0; // device and host pointers

Code Walkthrough 1
// walkthrough1.cu
#include <stdio.h>

int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);

int d_a=0, h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a )

{
printf("couldn't allocate memory\n");
return 1;
}
Code Walkthrough 1
// walkthrough1.cu
#include <stdio.h>

int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);

int d_a=0, h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a )

{
printf("couldn't allocate memory\n");
return 1;
}

cudaMemset( d_a, 0, num_bytes );

cudaMemcpy( h_a, d_a, num_bytes,
cudaMemcpyDeviceToHost );
Code Walkthrough 1
// walkthrough1.cu
#include <stdio.h>

int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);

int d_a=0, h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a )

{
printf("couldn't allocate memory\n");
return 1;
}

cudaMemset( d_a, 0, num_bytes );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)

printf("%d ", h_a[i] );
printf("\n");

free( h_a );
cudaFree( d_a );

return 0;
}
Example: Shuffling Data

// Reorder values based on keys

// Each thread moves one element
__global__ void shuffle(int* prev_array, int*
new_array, int* indices)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
new_array[i] = prev_array[indices[i]];
} Host Code

int main()
{
// Run grid of N/256 blocks of 256 threads each
shuffle<<< N/256, 256>>>(d_old, d_new, d_ind);
}
IDs and Dimensions
Threads:
3D IDs, unique within a block
Device
Blocks: Grid 1

2D IDs, unique within a grid Block Block Block

(0, 0) (1, 0) (2, 0)

Dimensions set at launch Block Block Block

(0, 1) (1, 1) (2, 1)
Can be unique for each grid
Built-in variables: Block (1, 1)

threadIdx, blockIdx Thread Thread Thread Thread Thread

(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
blockDim, gridDim
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Kernel with 2D Indexing

global void kernel( int *a, int dimx, int dimy )

{
int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*dimx + ix;

a[idx] = a[idx]+1;
}
int main()
{
int dimx = 16;
int dimy = 16;
int num_bytes = dimx*dimy*sizeof(int);

int d_a=0, h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a )

{
printf("couldn't allocate memory\n");
return 1;
}
__global__ void kernel( int *a, int dimx, int dimy )
cudaMemset( d_a, 0, num_bytes );
{
int ix = blockIdx.x*blockDim.x + threadIdx.x; dim3 grid, block;
int iy = blockIdx.y*blockDim.y + threadIdx.y; block.x = 4;
block.y = 4;
int idx = iy*dimx + ix; grid.x = dimx / block.x;
grid.y = dimy / block.y;
a[idx] = a[idx]+1; kernel<<<grid, block>>>( d_a, dimx, dimy );
}
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row<dimy; row++)

{
for(int col=0; col<dimx; col++)
printf("%d ", h_a[row*dimx+col] );
printf("\n");
}

free( h_a );
cudaFree( d_a );

return 0;
}
Blocks must be independent
Any possible interleaving of blocks should
be valid
presumed to run to completion without pre-
emption
can run in any order
can run concurrently OR sequentially

Blocks may coordinate but not synchronize

shared queue pointer: OK
shared lock: BAD … can easily deadlock

graphics
Rasterize
Hardware used to look like this
Pixel
One chip/board per stage
Test & Blend
Fixed data flow through
Framebuffer
pipeline
The Graphics Pipeline
Everything fixed function, with
Vertex
a certain number of modes

Rasterize
Number of modes for each
stage grew over time
Pixel

Hard to optimize HW
Test & Blend

Developers always wanted

Framebuffer more flexibility
The Graphics Pipeline

Vertex Remains a key abstraction

Rasterize Hardware used to look like this

Pixel
Vertex & pixel processing
became programmable, new
stages added
Test & Blend

GPU architecture increasingly

Framebuffer centers around shader
execution
The Graphics Pipeline

Vertex Exposing a (at first limited)

instruction set for some stages
Rasterize

Pixel Limited instructions &

instruction types and no
control flow at first
Test & Blend

Framebuffer
Expanded to full ISA
Why GPUs scale so nicely
Workload and Programming Model provide
lots of parallelism
Applications provide large groups of
vertices at once
Vertices can be processed in parallel
Apply same transform to all vertices
Triangles contain many pixels
Pixels from a triangle can be processed in
parallel
Apply same shader to all pixels
Very efficient hardware to hide serialization
bottlenecks
With Moore’s Law…

Vertex
Vertex
Pixel 0

Raster
Raster

Blend
Pixel 1
Pixel Pixel 2
Blend Pixel 3

Vrtx 0
Vrtx 1

Vrtx 2
More Efficiency

Note that we do the same thing for lots of

pixels/vertices
Control Control Control Control Control Control

ALU ALU ALU ALU ALU ALU

Control

ALU ALU ALU ALU ALU ALU

A warp = 32 threads launched together

Usually, execute together as well
Early GPGPU

All this performance attracted developers

To use GPUs, re-expressed their algorithms
as graphics computations
Very tedious, limited usability
Still had some very nice results

This was the lead up to CUDA

Questions?

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (77)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (78)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
Sample Mental Health Progress Note
96% (47)
Sample Mental Health Progress Note
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
How To Kiss A Woman's Breast
60% (114)
How To Kiss A Woman's Breast
14 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Developing Microsoft Media Foundation Applications
No ratings yet
Developing Microsoft Media Foundation Applications
385 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA
No ratings yet
CUDA
33 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Intro 2 Cuda
No ratings yet
Intro 2 Cuda
30 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
L06_GPGPU_CUDA_Programming_1
No ratings yet
L06_GPGPU_CUDA_Programming_1
23 pages
Class 10
No ratings yet
Class 10
13 pages
Week 11
No ratings yet
Week 11
21 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
CUDA Introduction
No ratings yet
CUDA Introduction
71 pages
CUDA Additionof2Vector
No ratings yet
CUDA Additionof2Vector
2 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-6-Cuda-kernel
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-6-Cuda-kernel
9 pages
Slides 8
No ratings yet
Slides 8
10 pages
Direct3D 11 Computer Shader More Generality For Advanced Techniques
No ratings yet
Direct3D 11 Computer Shader More Generality For Advanced Techniques
54 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
0S Labs Inclass Guide
No ratings yet
0S Labs Inclass Guide
21 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Discussion Questions 5
No ratings yet
Discussion Questions 5
2 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Google Colab Solution Activity
No ratings yet
Google Colab Solution Activity
5 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Cadence
No ratings yet
Cadence
16 pages
Shanling M0 Manual
No ratings yet
Shanling M0 Manual
24 pages
OV528-OmniVision - User Manual
No ratings yet
OV528-OmniVision - User Manual
41 pages
Genuine Origami - Jun Maekawa PDF
90% (10)
Genuine Origami - Jun Maekawa PDF
159 pages
Accomplishment Report
100% (1)
Accomplishment Report
3 pages
KA2297 Datasheet
No ratings yet
KA2297 Datasheet
6 pages
Alun Munslow Narrative and History Oxford Macmillan Education Palgrave 2018
100% (1)
Alun Munslow Narrative and History Oxford Macmillan Education Palgrave 2018
203 pages
TelScale SMSCGateway Release Notes
100% (1)
TelScale SMSCGateway Release Notes
12 pages
3 Islam
No ratings yet
3 Islam
41 pages
3ms 1st Term Exam
No ratings yet
3ms 1st Term Exam
2 pages
Math Unit 2 Grade 3 Lesson 45 47
No ratings yet
Math Unit 2 Grade 3 Lesson 45 47
130 pages
UNIT 10 - Exercises
No ratings yet
UNIT 10 - Exercises
4 pages
6th Grade Exam
No ratings yet
6th Grade Exam
3 pages
G1-Q2-Dll-Week 4-Math
No ratings yet
G1-Q2-Dll-Week 4-Math
14 pages
Tech Support Vocabulary
No ratings yet
Tech Support Vocabulary
6 pages
4341 11625 1 SM
No ratings yet
4341 11625 1 SM
7 pages
C Words
No ratings yet
C Words
9 pages
Song Analysis Bye Bye Love: Name of Student Institution Affiliation Course Name Tutor Due Date
No ratings yet
Song Analysis Bye Bye Love: Name of Student Institution Affiliation Course Name Tutor Due Date
4 pages
Phrasal Verb Vocabulary
No ratings yet
Phrasal Verb Vocabulary
4 pages
MA10..I With RS232 MIDAS User's Manual
No ratings yet
MA10..I With RS232 MIDAS User's Manual
81 pages
Ancient Gods 2022
100% (8)
Ancient Gods 2022
132 pages
Labview Based Paper
No ratings yet
Labview Based Paper
5 pages
Reading Quiz #1: Attempt History
No ratings yet
Reading Quiz #1: Attempt History
3 pages
Individual Assignment - Resume and Cover Letter
No ratings yet
Individual Assignment - Resume and Cover Letter
6 pages
One Day I Went To School. On The Way, I Crashed My Bike Into A Tree. The Bike Was Badly Damaged ..
No ratings yet
One Day I Went To School. On The Way, I Crashed My Bike Into A Tree. The Bike Was Badly Damaged ..
1 page
Index
No ratings yet
Index
1 page
Quora1
No ratings yet
Quora1
5 pages
Ec6301 Oops & Ds Notes
100% (8)
Ec6301 Oops & Ds Notes
178 pages
App
No ratings yet
App
8 pages

Gpu History and Cuda Programming Basics

Uploaded by

Gpu History and Cuda Programming Basics

Uploaded by

LECTURE 2: GPU HISTORY &

Basic Kernels and Execution on GPU

See the Programming Guide for the full API

Parallel code (kernel) is launched and

Many blocks of threads

Threads launched for a parallel section are

// Compute vector sum C = A+B

// allocate and initialize host (CPU) memory

// allocate device (GPU) memory

// copy host memory to device

// execute grid of N/256 blocks of 256 threads each

// execute grid of N/256 blocks of 256 threads each

// copy result back to host memory

// do something with the result…

// free device (GPU) memory

__global__ void kernel( int *a )

__global__ void kernel( int *a )

__global__ void kernel( int *a )

C/C++ with some restrictions:

Must be declared with a qualifier:

Pointers are just addresses

Host (CPU) manages device (GPU) memory:

cudaMemcpy( void *dst, void *src, size_t nbytes,

int *d_a=0, *h_a=0; // device and host pointers

int *d_a=0, *h_a=0; // device and host pointers

if( 0==h_a || 0==d_a )

int *d_a=0, *h_a=0; // device and host pointers

if( 0==h_a || 0==d_a )

cudaMemset( d_a, 0, num_bytes );

int *d_a=0, *h_a=0; // device and host pointers

if( 0==h_a || 0==d_a )

cudaMemset( d_a, 0, num_bytes );

for(int i=0; i<dimx; i++)

// Reorder values based on keys

2D IDs, unique within a grid Block Block Block

Dimensions set at launch Block Block Block

threadIdx, blockIdx Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

__global__ void kernel( int *a, int dimx, int dimy )

int *d_a=0, *h_a=0; // device and host pointers

if( 0==h_a || 0==d_a )

for(int row=0; row<dimy; row++)

Blocks may coordinate but not synchronize

Independence requirement gives scalability

Make them fast

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Vertex Key abstraction of real-time

Developers always wanted

Vertex Remains a key abstraction

Rasterize Hardware used to look like this

GPU architecture increasingly

Vertex Exposing a (at first limited)

Pixel Limited instructions &

Note that we do the same thing for lots of

ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU ALU ALU

A warp = 32 threads launched together

All this performance attracted developers

This was the lead up to CUDA

You might also like

global void kernel( int *a )

global void kernel( int *a )

global void kernel( int *a )

cudaMemcpy( void dst, void src, size_t nbytes,

int d_a=0, h_a=0; // device and host pointers

int d_a=0, h_a=0; // device and host pointers

int d_a=0, h_a=0; // device and host pointers

int d_a=0, h_a=0; // device and host pointers

global void kernel( int *a, int dimx, int dimy )

int d_a=0, h_a=0; // device and host pointers