Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
My Courses
DPSharma
WIKI
View Edit History
Welcome to Unit 2
Okay. Welcome to unit 2. It's good to see you again. So in the last unit you learned about the fundamentals of the GPU programming model and the basics of writing a simple program using CUDA. In this unit we're going to build off of that. We'll learn about important parallel communication patterns like scatter, and gather, and stencil. And we'll dive a little deeper into the GPU hardware and learn about things like global memory and shared memory, and we'll put these together to learn how to write efficient GPU programs.
Communication Patterns
Let's recap what we've learned so far. Parallel computing is all about many threads solving a problem by working together. And the key is this working together any books on business practices or teamwork will tell you that working together is really all about communication. In CUDA, this communication takes place or memory. For example threads may need to read from the same input location. Threads may need to write to the same output location. Sometimes threads may need to exchange partial results.
Scatter Quiz
Rather than having each thread read three neighboring elements, average their value, and write a single output result, we can have each thread read a single input result and add 1 3rd of its element's value to the three neighboring elements. So, each of these writes, really be a, an increment operation. You can imagine the same thing on our 2D image blurring example, where each thread takes one input element or pixel and writes a fraction of its value to the neighboring pixels. So, when each parallel task needs to write its result in a different place or in multiple places, we call this scatter because the threads are scattering the results over memory. You can see already a problem that we're going to have with scatter, you've got several threads attempting to write to the same place at more or less the same time. This is something we'll have to talk about later. Let's have a quick quiz on this. Suppose you have a list of basketball players. So, you've got a bunch of records and each one has the name of the player, and the height of the player, and the rank in the height. Okay. So, in the, in the league, or in the, on the team, whether this is the first tallest, the second tallest, the third tallest, the last tallest, or so on, okay? So, you've got the rank and height. And say, that you're goal now, is to write each player's record into its location in a sorted list. So, if we implement this in Cuda by having each thread read a record and look at the rank and use that rank to determine where to write into the array, is this a map operation, or a gather operation, or a scatter operation?
StencilStencil -Quiz
So the image blurring example that we've been using actually illustrates another important kind of communication pattern called Stencil. Stencil codes update each element in an array using neighboring array elements in a fixed pattern called the stencil. This is the stencil pattern we saw before. It's technically known as a 2D von Neumann stencil. Let me reiterate how this worked and this time use color coding a little differently to show you what's going on. So here, I've color coded the threads to show you which one is going to be working on which output element. So, I'll choose the blue one to be writing into this value. Here's where the red one's going to write it's output value. Here's where the green thread will write it's output value. And if you look at what's going to happen, each of these threads is going to read from several locations in memory surrounding the corresponding input value, and those locations are defined by the stencil. So, the blue thread will do a gather from these threads, and then the red colored thread will do a gather from the overlapping neighborhood. And then, the green thread will do a gather from this neighborhood and so on. Eventually, there'll be some other thread whose responsible for say, writing to this value, and that thread is going to go and access these values. So,
something you're going to notice right away is that there is a lot of data reuse going on. Many threads are accessing and computing from the same data. And exploiting that data reuse is something we're going to use later on when you're working on your homework assignment. We're going to try to exploit that reuse to speed up your homework assignment. Now, there are other common stencil patterns. For example, you might read from all of the neighboring elements, including the diagonal elements, and that would be called a 2D Moore pattern. And there are also 3D analogs of these, so for my next trick, I'm going to attempt to draw a three dimensions. Hopefully, you can see that from my drawing. So speaking of data reuse, here's a quick quiz. Can you figure out how many times a given input element in the array will be read when applying each of these stencils?
StencilStencil -Solution
The answer, of course is simply the number of elements in the stencil. Right, so every element in that array is going to be read five times by the 2D von Neumann stencil, because there are five entries in the neighborhood. So, all elements will be read five times by the 2D von Neumann stencil, nine times by the 2D Moore stencil, and seven times by the 3D von Neumann stencil.
Transpose Part 1
Another parallel communication pattern worth mentioning is called transpose. For example, you might have a 2D array, such as an image, laid out in row-major order. This means that the elements of the array, or the pixels of the image, are laid out one row at a time. And, I've color-code the rows here just to show you more clearly what I'm doing. But you might want to do some processing on the columns of the same edge. And so you'd want to lay out like this. This means you need to do an operation to reorder the elements. As you can see, I've drawn this as a scatter operation. So, each thread is reading from, an adjacent element in the array. But is writing to someplace scattered in memory, according to the stride of this row column, transpose. I could also have expressed this as a gather operation. Like so.
Transpose Part 2
So you can see where a transpose might come up when you're doing array operations, matrix operations, image operations. But the concept is generally applicable to all kinds of data structures. Let me, let me give an example. So, here's some sort of structure you might have, right? It's a perfectly reasonably structure foo. It's got a float field and an integer field and say, that you have a thousand of these. Well, what does that look like in memory? You're going to have the floats and the integers disbursed throughout memory. And as we will talk about later, it can be more efficient to access, if you're going to do a lot of processing on the floats, it can be more efficient to access all of the floats contiguously. You're going to want some operation that lets you take your, what's called an array of structures representation, and turn it into a structure of arrays. And that operation is, again, a transpose. By the way, these two terms are so common that array of structures is often abbreviated AOS. And structure of arrays is often abbreviated SOA. You'll see these terms come up frequently in parallel computing. So, to summarize, the transpose operation is where tasks reorder data elements in memory.
this next code, as I said, I put a guard around. Only odd the numbered threads are going to execute this. So that rules out a map, it's not one to one. And that also rules out a transpose operation, which is also one to one. And you really couldn't call it a stencil operation either because a stencil operation should generate a result for every element in the output array. And this doesn't do that. Now, if you look the first one, the thread is taking the input at a given location and multiplying it by pi and placing that into a couple of different places in the output array. In fact, it's incrementing a couple different places in the output array. So, this would be a scatter operation, the thread is computing for itself where it needs to write its result. And this final line would be a gather. You can see that every thread is writing a single location in the output array and it's reading from multiple places in the input array, locations that it computes. So this would be a gather. And again, this looks very much like a stencil operation since it's. Reading from a, a local neighborhood, and doing some averaging and writing the result, but I wouldn't call it a stencil because it's not writing into every location because of this guard here. So that's why I refer to this as a gather rather than a stencil.
multiprocessors or SMs, for short. Now, different GPUs have a different number of SMs. So, a small GPU might only have one SM. Whereas, a really big GPU might have 16 SMs, for example. And an SM, in turn, has many simple processors that can run a bunch of parallel threads. It also has some other things like some memory, that we'll talk more about in a moment. So, when you've got a Cuda program with a bunch of threads, organized into thread blocks, the important thing to understand is that the GPU is responsible for allocating blocks to SMs. Let me say that again because it's really important, the GPU is responsible for allocating the blocks to the SMs. So, as a programmer, all you have to worry about is giving the GPU a big pile of thread blocks and the GPU will take care of assigning them to run on the hardware SMs. All the SMs run in parallel and independently.
threads and blocks must complete, okay? You can't simply have a thread that hangs around forever. Processing input or, or doing something, because that thread must complete in order for the block that it's in to complete so that other threads and blocks can get scheduled onto that SM.
own copy of whatever local variables, you know. Private, variables to that thread, are sitting in local memory. So this is true. And threads from the same block have their own copy of local variables, and local memory. Right. So, just because they're from the same block doesn't mean that they share local variables. They share shared memory. Per block shared memory, but they all share local memory. So all four of these are true.
Synchronization - Barrier
So now we know that threads can access each others' results. They're shared in global memory. This means they can work together on a computation, but there's a problem. What if a thread tries to read a result before another thread has had a chance to compute or write it? This means we need synchronization. Threads need to synchronize with each other to avoid this situation. This need for synchronization is really one of the most fundamental problems in parallel computing. Now the simplest form of synchronization is called a barrier. A barrier is a point in the program where all the threads stop and wait. When all the threads have reached the barrier they can proceed on to the rest of the code.Let's illustrate this. Here's some threads and they're all proceeding along, three to code. I'll draw them in different colors, and I'm also drawing them different lengths. But you get the idea that they're, they're at different places in the code. They're at different points in their execution of the program. The idea is that when they reach the barrier, they're going to stop and wait for all the others to catch up. So in my drawing, the red one reaches the barrier first, stops. In the meantime, the blue one is preceding along and the green one is preceding along and eventually the blue one arrives at the barrier and stops and the green one is the last one to arrive at the barrier and stops. And now all three threads, say in my example I only have three threads, now that all, all of the threads have arrived at the barrier. Then they're all free to go again. And so they'll all proceed again, and we don't actually know which one's going to go first. It might be that the blue one is the first out of the gate, maybe green is next, maybe red is last. So let's look at some code to illustrate this.
Programming Model
So let's recreate our programming model diagram. We've got threads, and we've got thread blocks. And really, what is happening with these barriers, is that the sync threads call is creating a barrier within a thread block. So all the threads within this thread block are going to get to this sync threads call, stop, and wait until they've all arrived. And then proceed. And these thread blocks are organized in kernels. Every kernel has a bunch of thread blocks. We haven't talked about this really, but there's an implicit barrier between kernels. So if I launch two kernels, one after another, then by default kernel a will finish before kernel b is allowed to start. So all of these threads will complete before any of these threads get launched. So when you add in our memory model that we've talked about. Where every thread has access to its own local memory, to its block shared memory, and to the global memory shared by all threads and all kernels in the system. Then what you've is CUDA. At its heart, CUDA is a hierarchy of computation That'd the threads, thread blocks, and kernels, with the corresponding hierarchy of memory spaces, local, shared, and global, and synchronization primitives, sync threads, barriers, and the implicit barrier in between two synchronous kernels.
Global Memory
So, once again we've got a kernel. And, we know it's a kernel because it's been tagged with global, so it's going to run on the GPU but can be called from the host. And once again we're going to pass in a local variable, a parameter called array. And the trick is that this parameter is a pointer, and so this is actually pointing to some global memory that we've allocated elsewhere. And I'll show you how to do that in a moment. Once you've got a pointer to global memory, you can manipulate, or you can manipulate the contents of that memory just as you would manipulate any other chunk of memory. So in this case, I'm going to take my array And I'm going to set one of its elements which happens to be equal to the index to this thread to some number which happens to be 2.0 times the index to this thread. Again not a very useful function but it illustrates what's happening. So the point really is that since all the parameters to a function are, our local variables are private to that thread. If you want to manipulate global memory you're going to have to pass an appointer to that memory. And, of course that means you're going to have to allocate a pointer so, let's look at how that works. Here's the code to show off how we use global memory. The first thing I'm going to do is to clear some host memory okay. And, once again I'm using a convention that's starting a variable with a prefix H underscore indicates that this is memory that lives on the host. So here's an array of 128 floats. And I'm also going to declare a pointer that I'm going to use to point to the global memory that I allocate on the device. And once again the d underscore, the d underscore can mention indicates that this variable is on the device. Now I want to allocate some global memory on the device. So I'm going to use the function cudaMalloc. What's happening here is that I'm passing it a pointer to this variable. Which is itself a pointer. Right? And, cudaMalloc is going to allocate some memory in this case room for 128 floats, and stuff a pointer to that memory into the pointer d array. If you're allocating
memory you'll probably initialize to something. So we use cudaMemcpy for that operation. And in this case we pass in a pointer to the destination memory, which is this d array that we've allocated. And a pointer to the source memory, which is this h array variable. And then the number of bytes allocate, and then we indicate, whether we're copying from the hosted device, or vice versa. Oops, this is a bug. So now we've got a chunk of global memory, we've put something in it, and now we're ready to launch the kernel that's going to operate on that global memory. So here's the kernel that we saw earlier. Again, we're going to launch a single, thread block, consisting of 128 threads. I'm going to pa, pass in this pointer where I've allocated an, an initialized memory. So after this runs, presumably this code will do something to that memory that I pass in and now I'll need to copy it back onto the host. If I want to use the results of this kernel back on the host, then I need to copy the memory back, into host memory. And so, here's that operation. Once again, cudaMemcpy. This time, the destination is h array. The source is d array. This same number of bytes. And now, we're copying from device to host. Okay. So that's how you use global memory. Alright? The trick is that, since you can only pass in local variables to a kernel. You have to allocate and initialize global memory outside the kernel, and then pass in a pointer. Finally, let's look at how you would use shared memory.
Shared Memory
This example's a little more complicated. For clarity remember that I'm just hardcoding, the idea that there's 128 threads and therefore we're going to operate on 128 elements of the array, all right? And I'm going to skip all this sort of out-of-bounds check and assertions that you would normally use to make sure that you're not trying to access a piece of memory that's not there. So once again, we have a function, use shared memory GPU. This function is a kernel, and we're going to pass in a local variable, which is a pointer to a bunch of floats. And, this local variable is a pointer to global memory that's been allocated in advance. I wanted to come up with some code that actually did something using shared memory and that's why this, this function is a little more complicated than the examples you've seen. So, we're going to declare a few local variables that are private to each thread, a couple of indices we'll declare this variable index, just to be shorthand for thread index dot x. That's just to save some typing. And we're going to declare a floating point variable called average, and another one called sum. And we'll initialize sum to 0.0. Here's the big example. Now we're going to declare a variable that lives in shared memory. In this case, it's an array. Again, I've hard coded that array to contain 128 elements. It's an array of floats. And I tag it as being a shared memory with this shareddeclspec. And remember the whole point of shared memory is that this variable is visible to all the threads in the thread block. And it only has the lifetime of the thread block. So it doesn't exist outside of this thread block. Every thread block is going to have its own copy, it's own single copy of this array that all of the threads in that thread block can see. I call it sharray. The first thing we're going to do is put some data in this array. I, remember I passed in this array that's in global memory, called array, and I'm basically going to initialize the shared memory array to contain exactly what's in the global memory array. And notice how I do it. I'm copying data from this array in global memory to this array in shared memory. And I'm doing it with a single operation. How does this work? Every thread, if you look at this code, every thread is figuring out what its own index is, okay? Which thread it is, and its copying the element at array sub index into the element at shared array sub index. Okay, so since all of the threads in the thread block will be doing exactly this operation and since they will all have different values for index, when this single line has completed in all the threads then will have copied all of the information in this global memory array into our shared memory array. Okay, and the trick is that, that operation is going to, is going to take some time. Multiple threads are running. They're not running all at the same time, so it won't happen instantaneously. We have to make sure the all of these writes to shared memory have completed before we go on and do anything without array insured memory. And that's what the syncthreads operation is about. You've seen this before. Okay. So we call our barrier, to make sure that all the writes to shared memory have completed. And now after that barrier, we're guaranteed that this shared memory is fully populated. So okay. So every element has been initialized now. And just to be doing something, I said well let's just find the average of all the elements previous to this thread's index. So what we're going to do is we're going to, with this for loop, we're going to set i equals 0 and go up and up to index, which again is, which is the number of this thread. And we're going to accumulate all the variables in the shared memory array up to this, this, this index, this threads index. And after we're done, we'll compute the average. Which is simply equal to the sum divided by the number of elements that we, that we that we added up. And then, once again, now we need to do something with that average. And, and what I decided to do is to have every thread look at the average that it just computed of all of thread, of, excuse me, of all of the elements in the array to its left if you will. If the value in the array at this thread's index is greater than the average that it just computed of all of the elements in the array to the left of this thread's index. Then we're going to set array for this thread's index equal to that average. In other words, we're going to replace any threads whose, who are greater than the average of all the threads to the left, with that average. Notice that I'm operating on array, and not shared, shared array. So, I, I used this shared memory array to do my averaging. And that's a good idea. Because, remember that shared memory is, is fast. It's much faster to access than global memory. And so, here, every thread is acc, is, is accessing a bunch of elements in the array. So it make sense to, to move this array to first shared memories that moves faster. We'll talk about this later. But now I'm operating. I'm, I'm making this change back in the global memory array. And that means that, that when its, when its kind of complete this change is going to be seen by the host and would also potentially be seen by other thread blocks, other threads in other thread blocks if there were any. And then just to sort of make a point, here's a piece of code afterwards that has no effect at all, because I'm going to set an element of the shared memory array to 3.14, but then the kernel completes. Nothing ever gets done with that value that's sitting in shared memory. And since the shared memory has the lifetime of the thread block, this memory evaporates as soon as this thread block completes. So this code has no effect and in fact can probably be removed by the compiler. Calling a kernel that uses a shared memory is, is no different than calling a kernel that uses global memory right. Since all you can do is pass in a local variable that points to global memory if you've so allocated it. Then you know what that, what that kernel does with its shared memory is completely up to it and not visible to the host code at all. So, this code showing how to use shared memory is identical to the, to the code we saw up here.
oversimplification. Right? It's quite possible that many of these values will be promoted into registers. An optimizing compiler might rearrange accesses and so forth. But, the point is simply to get across the relative speeds of memories.
into only ten elements means that after each thread has added one to its corresponding element in the array, we're going to end up with 10 elements each containing 100,000, the number 100,000. And then, the code itself is simple. We have a timer class. Again. I haven't I've sort of hidden that away so you don't have to deal with it right now. We're going to declare some host memory. We're going to declare some GPU memory. And we're going to zero out that memory. You haven't seen cudaMemset before but it's exactly what you'd think. We're going to set all of the bytes of this device array to zero. Now, we're going to launch the kernel. And I've put a timer around this, because one of the things I want to show you is that atomic operations can slow things down. So, here's the kernel that we called, incrementnaive. We're going to launch it with a number of blocks equal to the total number of threads divided by the block width. And the number of threads per block, equal to the block width. And remember, these numbers initially are 1,000,000 and 1,000, okay? So, we're going to end up watching a thousand blocks and we're going to pass in the device array and then, its thread is going to do its incrementing. And when it's all done we're going to stop the timer and copy back the array using cudaMemcpy. Okay, so now, we'll, we'll take that array that we just incremented and copy it back to the host and then I hid away a little print array helper function. It just prints out the contents of the array. Then, I'm going to print out the amount of time taken, milliseconds, by this by this kernel that I measured with a timer. Okay. So, that's the whole Cuda program, let's compile and run it.
we get same result that time it took about 4 milliseconds. Closer to four again, down around three, and so forth.
Thread Divergence
You can also encounter thread divergence due to loops. Here's a somewhat contrived example. We have some pre-loop code in our kernel. All the threads are going to do this code. And then they're going to reach this for loop. And the way I've constructed it is, they're going to go through this loop a number of times, equal to their thread index. So thread 0 will execute this code once. Thread 1 will execute it twice. Thread 2 will execute it three times and so on. And then eventually they're all going to exit the loop and proceed and do some post loop stuff. So what does this look like? So here's a bunch of threads and they're all in the same thread block. I've just color coded them so you can see what they do more easily. And they're all going to be executing this pre-loop code, and then they're going to reach the loop, so thread 0 is going to proceed into this loop code. And they just keep going. Thread 1 is going to execute the loop code. And then execute again, and keep going. Thread 2. We'll execute the loop code again and again, and keep going. And thread 3, we'll execute the loop code 4 times. So if we think about these threads a little differently in terms of what they're doing over time. The first order is executing the pre-loop code. Then goes ahead and executes the loop code. And then it really just kind of sits around. Okay. Doesn't have anything to do for a while. Because in the mean time, thread 1 has executed the pre-loop code and then the loop code and then executes the loop code again. The 3rd thread executes the pre-loop code, the loop code, the loop code, then executes the loop code again. And the final thread executes pre-loop code, and then executes the loop code 4 times. And finally, all the threads can go ahead and proceed with post-loop code. This diagram, when you draw it like this, kind of gives you a sense of why loop divergence is a bad thing, why it slows you down. Because it turns out that the hardware loves to run these threads together and as long as they're doing the same thing, as long as they're executing the same code. Then it has the ability to do that, but in this case, the blue thread proceeds for a while, and then, because it's not going to do the loop again, it just ends up waiting around while the other threads do so. And then the red thread waits for a little while, the green thread waits a little bit. And only the purple thread was executing, at full efficiency the whole time. And so you can imagine that if the hardware gets some efficiency out of running all four of these threads at the same time, then that efficiency has been lost during this portion of the loop.
Summary of Unit 2
So, let's wrap up and, and summarize some of the things that we've learned. We've learned about parallel communication patterns going beyond the simple map operation that you saw in Unit 1 to encompass other important communication patterns like gather, scatter, stencil, and transpose. We've learned more about the GPU programming model and the underlying hardware such as how thread blocks run on streaming multi-processors, or SMs, and what assumptions you can make about ordering, and how threads and thread blocks can synchronize to safely share data and memory. We've learned about that GPU memory model, topics like local, and global, and shared memory and how atomic operations can simplify memory writes by concurrent threads. Finally, we got a quick preview of strategies for efficient GPU programming. The first principle is to minimize the time spent on memory accesses by doing things like coalescing global memory access. We saw that the extremely fast global memory on the GPU operates best when adjacent threads access contiguous chunks of global memory and this is called a coalesced memory access. We also learned to move frequently accessed data to faster memory. So, for example, promoting data to shared memory. And we learned that atomic memory operations have a cost, but they're great and they're useful and you shouldn't necessarily freak out about the cost. And often, the cost is negligible, but it's something to be aware of. Along the same lines, we learned about avoiding thread divergence that comes with branches and loops and, once again, thread divergence comes at a cost. You should be aware of that, but it isn't something that you should be freaked out about. We're going to revisit these topics and talk much more about optimizing GPU programs in Unit 5. So, that's it.
Congratulations
Okay, congratulations on finishing unit 2. Let's recap what we've learned. So you learned about how threads communicate with each other through memory and how they can access that memory efficiently when operating in concert. Along the way we've learned a few things about the GPU hardware, like it's memory model, and what assumptions we can and cannot make about when threads will run. Now you'll have a chance to put those concepts into practice. You'll implement a simple image blurring operation. Jen Hahn will tell you more.
Problem Set #2
In problem set number two you will be implementing a parallel algorithm for blurring images. And here is the example of the effect when you're talking about. So here's your original image and here's the image after we apply a blur effect to that original image. So blurring an image involves averaging a local neighborhood of pixels and it is expressed naturally using a parallel stencil operation. Stencil operations come up all the time, in all types of application domains. And this is why we are going to focus in on stencil in this homework. So let's take a closer look at a simple example demonstrating the kind of local averaging data we are talking about here. So suppose we have the following pixel representation of the image and we want to calculate the average intensity value for this pixel right here. So what do we do? First when we take the value of this pixel, and when we add this value, to the value of all its neighbors. So 10 average. And since we have 9 elements or 9 pixels here, when we multiply the sum By 1 ninth, and that is how you would calculate the average intensity value for a pixel in an image. So if we do this operation for every pixel in the image, one would arrive at a blurred version of input image. However it turns out that performing an unweighted average of pixels can sometimes look really ugly. And we can achieve a better looking blur by computing a weighted average of these pixels. And what I mean by weighted average is the following. So rather than multiplying 1 9th to each pixel value here, when we multiply each pixel value by a different weight. So w1 is different than w2. And w2 may be different than w3. And w3 may be different than w4. And that is the approach that we will take in problems in number 2. Here is a image produced by weighted blur, and here is an image produced by unweighted blur, and as you can see that the weighted blur is much smoother then the unweighted blurry counterpart. So in this problem set when we'll give you a small 2D array that contains weight values between 0 and 1 as follows. But this is just an example. The actual weight values that we will use will look like this. The smooth shape of the weight. As you can see, here. It will produce the nice looking blur effect that we saw earlier. And also here's a note. When we blur color images like blurring is color channel independent and when we'll
include a more detailed mathematical formula on blurring computations in the instructor comment. So this is what you need to do for prompts in number 2. First, you will need to write the actual blur kernel. Second, you will need to actually write the kernel that separates the color image to its R, G, B channels. And third when we'll give you the opportunity to allocate memory on the device for the filter. So you will have an opportunity to Cuda meme copies. And fourth, you will have to set the correct or the optimal grid and block size for this problem set. And, as you remember in problem set number one, the grid and the block size has a huge impact on your program's execution time. So, set this size correctly and be careful. So, lastly your submission will be evaluated based on correctness and speed. But we recommend that you focus on correctness first. Then after your blurring kernal is run correctly then we recommend that you try to make it run faster. And lastly we have supplied serial code that you can reference and compare your solution again. Good luck on writing problems in number 2. And if you have any questions feel free to ask in the class rooms.
I NF OR M A T I ON
C OM M UNI T Y
UDA CITY