08_dataparallel
08_dataparallel
Data-Parallel Thinking
Parallel Computing
Stanford CS149, Fall 2023
Today’s theme
▪ Many of you are now likely accustomed to thinking about parallel programming in terms of
“what workers do”
▪ Today I would like you to think about describing algorithms in terms of operations on sequences
of data
- map - sort
- filter - groupBy
- fold / reduce - join
- scan / segmented scan - partition / flatten
L2 Cache (6 MB)
900 GB/sec
This chip can concurrently execute up to 163,860 CUDA threads! (programs that do not expose significant
amounts of parallelism, and don’t have high arithmetic intensity, will not run efficiently on GPUs!)
- +
▪ Important: unlike arrays, programs can only access elements of a sequence through
specific operations
▪ In C++: 3 8 4 6 3 9 2 8
template<class InputIt, class OutputIt, class UnaryOperation>
OutputIt transform(InputIt first1, InputIt last1, OutputIt d_first, map f
UnaryOperation unary_op);
C++ 13 18 14 16 13 19 12 18
int f(int x) { return x + 10; }
3 8 4 6 3 9 2 8 3 8 4 6 3 9 2 8
fold 10 +
+ + + + + + + +
53
10 13 21 25 31 34 43 45 53
id id id id
comb comb
comb
* No need for comb if f::(b,b)->b is an associative binary operator Stanford CS149, Fall 2023
Scan
f :: (a,a) -> a (associative binary op)
scan :: a -> ((a,a) -> a) -> seq a -> seq a
3 8 4 6 3 9 2 8
scan_inclusive +
3 11 15 21 24 33 35 43
Alternative form: “scan exclusive”: out[i] is the scan result for all elements up to, but excluding, in[i].
a0 a0-1 a1-2 a2-3 a3-4 a4-5 a5-6 a6-7 a7-8 a8-9 a9-10 a10-11 a11-12 a12-13 a13-14 a14-15
a0 a0-1 a0-2 a0-3 a1-4 a2-5 a3-6 a4-7 a5-8 a6-9 a7-10 a8-11 a9-12 a10-13 a11-14 a12-15
a0 a0-1 a0-2 a0-3 a0-4 a0-5 a0-6 a0-7 a1-8 a2-9 a3-10 a4-11 a5-12 a6-13 a7-14 a8-15
a0 a0-1 a0-2 a0-3 a0-4 a0-5 a0-6 a0-7 a0-8 a0-9 a0-10 a0-11 a0-12 a0-13 a0-14 a0-15
a0 a0-1 a2 a2-3 a4 a4-5 a6 a6-7 a8 a8-9 a10 a10-11 a12 a12-13 a14 a14-15
a0 a0-1 a2 a0-3 a4 a4-5 a6 a4-7 a8 a8-9 a10 a8-11 a12 a12-13 a14 a12-15
a0 a0-1 a2 a0-3 a4 a4-5 a6 a0-7 a8 a8-9 a10 a8-11 a12 a12-13 a14 a8-15
a0 a0-1 a2 a0-3 a4 a4-5 a6 a0-7 a8 a8-9 a10 a8-11 a12 a12-13 a14 0
a0 a0-1 a2 a0-3 a4 a4-5 a6 0 a8 a8-9 a10 a8-11 a12 a12-13 a14 a0-7
a0 a0-1 a2 0 a4 a4-5 a6 a0-3 a8 a8-9 a10 a0-7 a12 a12-13 a14 a0-11
a0 0 a2 a0-1 a4 a0-3 a6 a0-5 a8 a0-7 a10 a0-9 a12 a0-11 a14 a0-13
0 a0 a0-1 a0-2 a0-3 a0-4 a0-5 a0-6 a0-7 a0-8 a0-9 a0-10 a0-11 a0-12 a0-13 a0-14
Stanford CS149, Fall 2023
Work efficient exclusive scan algorithm
(with ⊕ = “+”)
Up-sweep:
for d=0 to (log2n - 1) do
forall k=0 to n-1 by 2d+1 do
a[k + 2d+1 - 1] = a[k + 2d - 1] + a[k + 2d+1 - 1]
Down-sweep:
x[n-1] = 0
for d=(log2n - 1) down to 0 do
forall k=0 to n-1 by 2d+1 do
tmp = a[k + 2d - 1]
a[k + 2d - 1] = a[k + 2d+1 - 1]
a[k + 2d+1 - 1] = tmp + a[k + 2d+1 - 1]
a0 a0-1 a2 a2-3 a4 a4-5 a6 a6-7 a8 a8-9 a10 a10-11 a12 a12-13 a14 a14-15
a0 a0-1 a2 a0-3 a4 a4-5 a6 a4-7 a8 a8-9 a10 a8-11 a12 a12-13 a14 a12-15
a0 a0-1 a2 a0-3 a4 a4-5 a6 a0-7 a8 a8-9 a10 a8-11 a12 a12-13 a14 a8-15
a0 a0-1 a2 a0-3 a4 a4-5 a6 a0-7 a8 a8-9 a10 a8-11 a12 a12-13 a14 0
a0 a0-1 a2 a0-3 a4 a4-5 a6 0 a8 a8-9 a10 a8-11 a12 a12-13 a14 a0-7
a0 a0-1 a2 0 a4 a4-5 a6 a0-3 a8 a8-9 a10 a0-7 a12 a12-13 a14 a0-11
a0 0 a2 a0-1 a4 a0-3 a6 a0-5 a8 a0-7 a10 a0-9 a12 a0-11 a14 a0-13
0 a0 a0-1 a0-2 a0-3 a0-4 a0-5 a0-6 a0-7 a0-8 a0-9 a0-10 a0-11 a0-12 a0-13 a0-14
P1 P2 Stanford CS149, Fall 2023
Scan: two processor (shared memory) implementation
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15
P1 P2
Add base to elements a8 thru a8-11 Add base to elements a8-12 thru a8-15
Work: ??
Work: N lg(N)
Work-efficient formulation of scan is not beneficial in this context because it results in low SIMD utilization.
Work efficient algorithm would require more than 2x the number of instructions as the implementation above!
length 32 SIMD scan length 32 SIMD scan length 32 SIMD scan length 32 SIMD scan
warp 0 warp 1 warp 2 warp 3
a0-31 a32-63 a64-95 a96-127
ptr[idx] = val;
}
Kernel
Block 0 scan
Launch 2
Exceeding 1 million elements requires partitioning phase two into multiple blocks
Stanford CS149, Fall 2023
Scan implementation
▪ Parallelism
- Scan algorithm features O(N) parallel work
- But efficient implementations only leverage as much parallelism as required to make good utilization of the
machine
- Goal is to reduce work and reduce communication/synchronization
▪ Locality
- Multi-level implementation to match memory hierarchy
(CUDA example: per-block implementation carried out in local memory)
let A = [[1,2],[6],[1,2,3,4]]
let ⊕ = +
segmented_scan_exclusive(⊕, A) = [[0,1], [0], [0,1,3,6]]
Down-sweep:
data[n-1] = 0
for d=(log2n - 1) down to 0 do:
forall k=0 to n-1 by 2d+1 do:
tmp = data[k + 2d - 1]
data[k + 2d - 1] = data[k + 2d+1 - 1]
if flag_original[k + 2d] == 1: # must maintain copy of original flags
data[k + 2d+1 - 1] = 0 # start of segment
else if flag[k + 2d - 1] == 1:
data[k + 2d+1 - 1] = tmp
else:
data[k + 2d+1 - 1] = tmp + data[k + 2d+1 - 1]
flag[k + 2d - 1] = 0
1 1 1 1 1 1 1 1 1 1
a0 a0-1 a2 a0-3 a4 a5 a6 a5-7 a8 a8-9 a10 a10-11 a12 a12-13 a14 a10-15
1 1 1 1 1 1 1 1 1
a0 a0-1 a2 a0-3 a4 a5 a6 a5-7 a8 a8-9 a10 a10-11 a12 a12-13 a14 0
1 1 1 1 1 1 1 1
a0 a0-1 a2 a0-3 a4 a5 a6 0 a8 a8-9 a10 a10-11 a12 a12-13 a14 0
1 1 1 1 1 1
a0 a0-1 a2 0 a4 a5 a6 a0-3 a8 a8-9 a10 0 a12 a12-13 a14 a10-11
1 1 1
a0 0 a2 a0-1 a4 a0-3 a6 a5 a8 0 a10 0 a12 a10-11 a14 a10-13
▪ Segmented scan
- Express computation and operate on irregular data structures (e.g., list of lists) in a regular, data
parallel way
Data sequence
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Output sequence
3 12 4 9 9 15 13 0 Index sequence
output_seq = scatter(index_seq, data_seq) “Scatter data from data_seq according to indices in index_seq”
Data sequence
0 1 2 3 4 5 6 7
Output sequence
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 0 4 7 1 12 9 14 Index sequence
Stanford CS149, Fall 2023
Gather machine instruction
gather(R1, R0, mem_base); “Gather from buffer mem_base into R1 according to indices specified by R0.”
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
mem_base
3 12 4 9 9 15 13 0
index: 0 2 1 4 3 6 7 5
input: 3 8 4 6 3 9 2 8
Step 2: Compute starts of each range of values with the same index number
starts: [1, 0, 0, 1, 0, 1]
▪ Filter
- Remove elements from sequence that do not 3 8 4 6 3 9 2 8
match predicate
filter f
▪ Sort 8 4 6 2 8
Assume f() filters elements whose value is odd
0 1 2 3
3
4 5 6 4 7
1
5 2
8 9 10 11
0
12 13 14 15
Step 2: sort results by cell (notice that the particle index array is permuted based on sort) 12 13 14 15
particle_index: 3 5 1 2 4 0
grid_cell: 4 4 6 6 6 9
This solution maintains a large amount of parallelism and
Step 3: find start/end of each cell (parallel over particle_index elements) removes the need for fine-grained synchronization…
this_cell = grid_cell[thread_index]; at the cost of a sort and extra passes over the data (extra BW)!
if (thread_index == 0) // special case for first cell
cell_starts[this_cell] = 0;
else if (this_cell != grid_cell[thread_index-1]) {
cell_starts[this_cell] = thread_index; This code is run for each element of the particle_index array.
cell_ends[grid_cell[thread_index-1]] = thread_index; (each invocation has a unique valid of ‘index’)
}
if (thread_index == numParticles-1) // special case for last cell
cell_ends[this_cell] = thread_index+1;
cell_starts 0 2 5 ...
cell_ends 2 5 6 ...
(not inclusive)
0 1 2 3 4 5 6 7 8 9 10 Stanford CS149, Fall 2023
Another example: parallel histogram
▪ Consider computing a histogram for a sequence of values
int f(float value); // maps array values to histogram bin id’s
float input[N];
int histogram_bins[NUM_BINS]; // assume bins are initialized to 0
float input[N];
int histogram_bins[NUM_BINS];
// temporary buffers
int bin_ids[N]; // bin_ids[i] = id of bin that element i goes in
int sorted_bin_idx[N];
int bin_starts[NUM_BINS]; // initialized to -1
if (bin_starts[thread_index] == -1) {
histogram_bins[thread_index] = 0; // no items in this bin
} else {
// Tricky edge case: if the next bin is empty, then must search
// forward to find the next non-empty bin
int next_idx = thread_index+1;
while(next_idx < num_bins && bin_starts[next_idx] == -1)
next_idx++;
Assume variable thread_index is the “thread index” associated with the invocation of the kernel function
Stanford CS149, Fall 2023
Summary
▪ Data parallel thinking:
- Implementing algorithms in terms of simple (often widely parallelizable, efficiently
implemented) operations on large data collections