Week 11
Week 11
CUDA Part 2
Global memory requires you to use multiple kernel invocations (since you don't know when a r/w op is done)
Reductions
A commonly used operation is the reduction operation, where some function (ie. sum) is used to sum up the values of an array The bad but easy way:
extern __shared__ float cache[]; cache[i] = threadIdx.x; __syncthreads(); //ensure all threads done writing to cache if(thread.Idx == 0){ for(int i=0; i<N; i++){ cache[0] += cache[i]; } printf("%f\n", cache[0]); }
Reductions
The better way:
extern __shared__ float cache[]; cache[i] = threadIdx.x; __syncthreads(); for(int i = blockDim.x; i>0; i >>= 1) { int halfPoint = (i >> 1); if(threadIdx.x < halfPoint) cache[threadIdx.x]+=cache[threadIdx. x + halfPoint]; __syncthreads(); } printf("%f\n", cache[0]);
Compute 2.0 added built in functionality for RNG Random headers can be found in <curand.h> and <curand_kernel.h>
Can seed each thread with a different seed (ex. theadIdx.x), and then set the state to zero (ie. don't advance each thread by 2^67)
This may introduce some bias / correlation, but not many other options Don't have the same assurance of statistical properties remaining the same as seed changes It's also faster (by a factor of 10x or so)
There could also be bias introduced by the choice of hash function itself...
CUDA Libraries
CUDA has a lot of libraries that you can use to make things much easier
Lots of unofficial libraries, but we'll cover some of the main included libraries here
Not used by our projects, but if you pursue CUDA in the future, you will most certainly make use of these
CUFFT
Based on the successful FFTW library for C++ Adds FFT (Fast Fourier Transform) functionality to CUDA, as you might expect 1D, 2D, 3D, complex and real data 1D transform size of up to 128 million elements Order of magnitude speedup from multi-core CPU implementations
CUBLAS
CUDA Basic Linear Algebra Subroutines Lets you do linear algebra-y things easily
You get a handle and call built in functions on it with your data Kernel launches replaced by library functions
CURAND
Generate large batches of random numbers normal math functionality, exactly like C/C++ (same #include <math.h>) Sparse matrix manipulation Image, signal processing primitives Parallel algorithms and data structures
CUSPARSE
Thrust
Recall that each ant must be able to add its tour to the pheromone graph before proceeding to the next time step
Need to synchronize across ants (ie. after each ant has constructed their tour, they must add pheromone to the graph before we can compute the next time step's traversal probabilities)
Recall that we do have built in functionality to synchronize threads in a block without exiting the kernel...
Intuitive subdivision!
Number of blocks = Number of colonies Number of ants per block = Number of threads per block Total number of threads = Number of ants x Number of colonies