0% found this document useful (0 votes)

11 views

Week 11

The document discusses parallelizing ant colony optimization on GPUs. It describes using shared memory and thread synchronization to allow ants to update pheromone trails before proceeding to the next time step. It also describes running multiple ant colonies in parallel by assigning each colony to its own GPU block.

Uploaded by

Tg Wallas

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Week 11

Uploaded by

Tg Wallas

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CS195V Week 11

CUDA Part 2

Shared Memory and Thread Communication

Recall shared memory from last lecture
Fast, equivalent to L1 cache (~64kb) Shared between threads in a block

Best way to communicate between threads

You can synchronize across threads in a block to ensure r/w is complete

Global memory requires you to use multiple kernel invocations (since you don't know when a r/w op is done)

Reductions
A commonly used operation is the reduction operation, where some function (ie. sum) is used to sum up the values of an array The bad but easy way:
extern __shared__ float cache[]; cache[i] = threadIdx.x; __syncthreads(); //ensure all threads done writing to cache if(thread.Idx == 0){ for(int i=0; i<N; i++){ cache[0] += cache[i]; } printf("%f\n", cache[0]); }

Reductions
The better way:
extern __shared__ float cache[]; cache[i] = threadIdx.x; __syncthreads(); for(int i = blockDim.x; i>0; i >>= 1) { int halfPoint = (i >> 1); if(threadIdx.x < halfPoint) cache[threadIdx.x]+=cache[threadIdx. x + halfPoint]; __syncthreads(); } printf("%f\n", cache[0]);

CUDA Random Number Generation (RNG)

Historically it has been very difficult to create random numbers on the GPU
Had to sample from random textures / implement your own PRNG (ie. LCG, Mersenne, etc.) Implementing a parallel PRNG isn't trivial (one option is to give each PRNG thread a different seed, but you loose some randomness guarantees)! And you wonder why GLSL noise() doesn't work...

Compute 2.0 added built in functionality for RNG Random headers can be found in <curand.h> and <curand_kernel.h>

CUDA Random Number Generation (RNG)

We will give a randState to each CUDA thread, from which it can sample from On the host, create a device pointer to hold the randomStates Malloc number of states equal to number of threads Pass the device pointer to your function Init the random states Call a random function (ie. curand_uniform) with the state given to that thread Free the randomStates

CUDA Random Number Generation (RNG)

__global__ void init_stuff(curandState *state) { int idx = blockIdx.x * blockDim.x + threadIdx.x; curand_init(1337, idx, 0, &state[idx]); } __global__ void make_rand(curandState *state, float *randArray) { int idx = blockIdx.x * blockDim.x + threadIdx.x; randArray[idx] = curand_uniform(&state[idx]); } void host_function() { curandState *d_state; cudaMalloc(&d_state, nThreads * nBlocks); init_stuff<<<nblocks, nthreads>>>(d_state); make_rand<<<nblocks, nthreads>>>(d_state, randArray); cudaFree(d_state); }

CUDA Random Number Generation (RNG)

The total state space of the PRNG before you start to see repeats is about 2^190 CUDA's RNG is designed such that given the same seed in each thread, it will generate random numbers spaced 2^67 numbers away in the PRNG's sequence
When calling curand_init with a seed, it scrambles that seed and then skips ahead 2^67 numbers (this is kind of expensive but has some nice properties) This even spacing between threads guarantees that you can analyze the randomness of the PRNG and those results will hold no matter what seed you use

CUDA Random Number Generation (RNG)

What if you're running millions of threads and each thread needs RNs?
Not completely uncommon You could run out of state space per thread and start seeing repeats... ((2^190) / (10^6)) / (2^67) = 1.0633824 10^31

Can seed each thread with a different seed (ex. theadIdx.x), and then set the state to zero (ie. don't advance each thread by 2^67)
This may introduce some bias / correlation, but not many other options Don't have the same assurance of statistical properties remaining the same as seed changes It's also faster (by a factor of 10x or so)

CUDA Random Number Generation (RNG)

Why do we lose some statistical guarantees of randomness? Suppose we choose seeds equal to the threadIdx.x (ie. 0,1,2...) Now suppose the seed scrambler (essentially a hash function) has a collision between threads 0 and 4
This means threads 0 and 4 will be generating the same sequence of numbers

There could also be bias introduced by the choice of hash function itself...

CUDA Random Number Generation (RNG)

The take home message:
Depending on your problem you may need to be careful when using CUDA's RNG (ie. crypto) If you're making pretty pictures (ie. graphics) it probably doesn't matter

CUDA Libraries

CUDA has a lot of libraries that you can use to make things much easier

Lots of unofficial libraries, but we'll cover some of the main included libraries here

Not used by our projects, but if you pursue CUDA in the future, you will most certainly make use of these

CUFFT

Based on the successful FFTW library for C++ Adds FFT (Fast Fourier Transform) functionality to CUDA, as you might expect 1D, 2D, 3D, complex and real data 1D transform size of up to 128 million elements Order of magnitude speedup from multi-core CPU implementations

CUFFT Example (3D)

#include <cufft.h> #define NX 64 #define NY 64 #define NZ 128 cufftHandle plan; cufftComplex *data1, *data2; cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ); cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ); /* Create a 3D FFT plan. */ cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); /* Transform the first signal in place. */ cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); /* Transform the second signal using the same plan. */ cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); /* Destroy the cuFFT plan. */ cufftDestroy(plan); cudaFree(data1); cudaFree(data2); // note that the cufft code takes the place of your cuda kernel

CUBLAS

CUDA Basic Linear Algebra Subroutines Lets you do linear algebra-y things easily

152 standard BLAS (Basic Linear Algebra Subprogram) operations

Easy way to do matrix multiplication, etc. Execution is very similar to CUFFT

You get a handle and call built in functions on it with your data Kernel launches replaced by library functions

Other CUDA Libraries

CURAND

Generate large batches of random numbers normal math functionality, exactly like C/C++ (same #include <math.h>) Sparse matrix manipulation Image, signal processing primitives Parallel algorithms and data structures

CUDA math library

CUSPARSE

NVIDIA Performance Primitives (NPP)

Thrust

Ant Colony Optimization

http://www.csse.monash.edu. au/~berndm/CSE460/Lectures/cse460-9.pdf

Parallelizing ACO on the GPU

We can have thousands or even millions of ants
Easy way of splitting up the work is to give each worker ant a thread

Recall that each ant must be able to add its tour to the pheromone graph before proceeding to the next time step
Need to synchronize across ants (ie. after each ant has constructed their tour, they must add pheromone to the graph before we can compute the next time step's traversal probabilities)

Parallelizing ACO on the GPU

We don't have any easy way to synchronize across all threads in the GPU In fact the only way is to call multiple kernel invocations with a cudaThreadSynchronize()
for(int i=0; i<numIterations; i++) { tsp<<<1, numAnts>>>(); cudaThreadSynchronize(); }

This is expensive (we're not reaching full occupancy)

Recall that we do have built in functionality to synchronize threads in a block without exiting the kernel...

Multi Colony Optimization on the GPU

An easy way of parallelizing ACO is to create multiple colonies of ants
Each colony of ants stores their own pheromone graph and creates multiple iterations of tours Every so often, communicate between the different colonies (ie. pass best paths / share pheromone information)

Intuitive subdivision!
Number of blocks = Number of colonies Number of ants per block = Number of threads per block Total number of threads = Number of ants x Number of colonies

Multi Colony Optimization on the GPU

Now when all ants have created their tours, we can sum up their total contribution for each edge across all threads in a block by using __syncthreads() and a reduction Similarly, we can update the pheromone evaporation by just having each thread in a block update some set of edges, and then __syncthreads() again before the next tour construction step

1F4GK Parts Manual (PM5UC-1F4GK) March2018
100% (6)
1F4GK Parts Manual (PM5UC-1F4GK) March2018
451 pages
Ge-Energy Company Contract Letter
100% (1)
Ge-Energy Company Contract Letter
4 pages
Reyco PDF
100% (1)
Reyco PDF
44 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CUDA
No ratings yet
CUDA
33 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
42 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Hpc file
No ratings yet
Hpc file
22 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
3-computation
No ratings yet
3-computation
28 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Using CUDA
No ratings yet
Using CUDA
57 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Huber a CPlusPlus Toolchain for Your GPU
No ratings yet
Huber a CPlusPlus Toolchain for Your GPU
24 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
B1 Data Parallel
No ratings yet
B1 Data Parallel
52 pages
Effects of Topography in Time-Domain Simulations of Outdoor Sound Propagation
No ratings yet
Effects of Topography in Time-Domain Simulations of Outdoor Sound Propagation
16 pages
"Hands-Free PC Control" Controlling of Mouse Cursor Using Eye Movement
No ratings yet
"Hands-Free PC Control" Controlling of Mouse Cursor Using Eye Movement
5 pages
Hamilton2013hexagonal 1
No ratings yet
Hamilton2013hexagonal 1
9 pages
Gnuplot 3d Example v3
No ratings yet
Gnuplot 3d Example v3
2 pages
C++ Templates: Characteristics of Generic Libraries
100% (1)
C++ Templates: Characteristics of Generic Libraries
18 pages
Received 10 April 1997. Read 16 March 1997. Published 30 December 1998.
No ratings yet
Received 10 April 1997. Read 16 March 1997. Published 30 December 1998.
18 pages
Design Study of An Electron Gun For A High Power Microwave Source
No ratings yet
Design Study of An Electron Gun For A High Power Microwave Source
4 pages
Top 5 Dead Hard Drive Projects - Hacked Gadgets - DIY Tech Blog
No ratings yet
Top 5 Dead Hard Drive Projects - Hacked Gadgets - DIY Tech Blog
26 pages
Este Man Si Explica Radio Control Como Es y Con Todo
No ratings yet
Este Man Si Explica Radio Control Como Es y Con Todo
216 pages
Numpy Manual Contents - NumericalPython v1
No ratings yet
Numpy Manual Contents - NumericalPython v1
53 pages
Suess Leopold Common Mistakes 06
No ratings yet
Suess Leopold Common Mistakes 06
12 pages
Bootstrapping Definition - Investopedia
No ratings yet
Bootstrapping Definition - Investopedia
3 pages
A Research Paper On An Impact of Goods and Service Tax GST On Indianeconomy 2151 6219 1000264
No ratings yet
A Research Paper On An Impact of Goods and Service Tax GST On Indianeconomy 2151 6219 1000264
2 pages
Digital Assignment - 3
No ratings yet
Digital Assignment - 3
11 pages
Pages From 5B - 78-32-36 - R65 - Jun19
No ratings yet
Pages From 5B - 78-32-36 - R65 - Jun19
12 pages
Clarke Ordinary People
No ratings yet
Clarke Ordinary People
19 pages
DCN Product Training 20220920 - Wireless Product
No ratings yet
DCN Product Training 20220920 - Wireless Product
52 pages
Data Sheet - Model 442
No ratings yet
Data Sheet - Model 442
2 pages
Unraveling The Complexity: A DEMATEL Analysis of The Negative Impact of Artificial Intelligence (AI) Adoption Among Students in Higher Education
No ratings yet
Unraveling The Complexity: A DEMATEL Analysis of The Negative Impact of Artificial Intelligence (AI) Adoption Among Students in Higher Education
13 pages
Manila Jockey Club Vs CA
No ratings yet
Manila Jockey Club Vs CA
2 pages
Intelligence, Surveillance and Reconnaissance in 2035 and Beyond - Peter Roberts & Andrew Payne
No ratings yet
Intelligence, Surveillance and Reconnaissance in 2035 and Beyond - Peter Roberts & Andrew Payne
36 pages
1.identification of Petrol Engine Components
No ratings yet
1.identification of Petrol Engine Components
3 pages
Introduction To Economics - 1
100% (1)
Introduction To Economics - 1
47 pages
MTC Tipb 131-23
No ratings yet
MTC Tipb 131-23
4 pages
CV Ohm Vaghela
No ratings yet
CV Ohm Vaghela
2 pages
Number of Months of Experience Related To Non-Farm Livelihoods in Rural Area in ODISHA
No ratings yet
Number of Months of Experience Related To Non-Farm Livelihoods in Rural Area in ODISHA
6 pages
The Grain Free Sugar Free Dairy Free Family Cookbook Simple and Delicious Recipes For Cooking With Whole Foods On A Restrictive Diet Leah Webb
100% (4)
The Grain Free Sugar Free Dairy Free Family Cookbook Simple and Delicious Recipes For Cooking With Whole Foods On A Restrictive Diet Leah Webb
62 pages
Pro Z790-P Wifi PRO Z790-P: Motherboard
No ratings yet
Pro Z790-P Wifi PRO Z790-P: Motherboard
422 pages
Food Safety On Wheels
100% (1)
Food Safety On Wheels
8 pages
iDS-TCM403-B (I) High Performance ANPR Bullet Camera
No ratings yet
iDS-TCM403-B (I) High Performance ANPR Bullet Camera
8 pages
MCQ For 9th Class
0% (1)
MCQ For 9th Class
20 pages
Force XXI Battle Command Brigade and Below - Wikipedia
No ratings yet
Force XXI Battle Command Brigade and Below - Wikipedia
3 pages
Lab 5 Enthalpy of Vaporization
No ratings yet
Lab 5 Enthalpy of Vaporization
4 pages
Converting Between Fractions, Decimals and Percentages - Knowledge Organiser
No ratings yet
Converting Between Fractions, Decimals and Percentages - Knowledge Organiser
2 pages
Dahua Eyeball Network Camera - Quick Start Guide - V1.0.2
No ratings yet
Dahua Eyeball Network Camera - Quick Start Guide - V1.0.2
17 pages
Doc1 v4 5 0 Designers Guide 13e
No ratings yet
Doc1 v4 5 0 Designers Guide 13e
498 pages
Surgical Instrument
No ratings yet
Surgical Instrument
59 pages
Gold Coast.: (For Report For 1900, See No. 344.)
No ratings yet
Gold Coast.: (For Report For 1900, See No. 344.)
36 pages

Week 11

Uploaded by

Week 11

Uploaded by

CS195V Week 11

Shared Memory and Thread Communication

Best way to communicate between threads

CUDA Random Number Generation (RNG)

CUDA Random Number Generation (RNG)

CUDA Random Number Generation (RNG)

CUDA Random Number Generation (RNG)

CUDA Random Number Generation (RNG)

CUDA Random Number Generation (RNG)

CUDA Random Number Generation (RNG)

CUFFT Example (3D)

152 standard BLAS (Basic Linear Algebra Subprogram) operations

Easy way to do matrix multiplication, etc. Execution is very similar to CUFFT

Other CUDA Libraries

CUDA math library

NVIDIA Performance Primitives (NPP)

Ant Colony Optimization

Parallelizing ACO on the GPU

Parallelizing ACO on the GPU

This is expensive (we're not reaching full occupancy)

Multi Colony Optimization on the GPU

Multi Colony Optimization on the GPU

You might also like