PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
Week 1. SIMD programming on PPE Materials are adapted from the textbook
XIO: Rambus Extreme Data Rate (XDR) I/O (XIO) memory channels
Predicates: compare values and return an integer that may be used directly for branching
EX: vec_all_eq(a,b), vec_any_eq(a,b)
__attribute__(alligned())
Variables are aligned at a boundary corresponding to its datatype size
The datatype size of vector is 16 (bytes)
Example 1: array-summing
Traditional approach
// 16 iterations of a loop int rolled_sum(unsigned char bytes[16]) { int i; int sum = 0; for (i = 0; i < 16; ++i) { sum += bytes[i]; } return sum; }
Function Description
Functions d = vec_perm(a,b,c) d = vec_ld(a,b) d = vec_lvsl(a,b) d = vec_sums(a,b) d = vec_sum4s(a,b) Explanation Vector Permute Vector Load Indexed Vector Load for Shift Left Vector Sum Saturated Vector Sum Across Partial (1/4) Saturated
d = vec_ld(a,b)
Load 16 bytes from memory and return to d a (an integer) is added to the address of b (a pointer), and the sum is truncated to a multiple of 16 bytes. The result is the contents of the 16 bytes of memory starting at this address.
If the address is not aligned on a 16 bytes boundary, d is loaded from the next-lowest 16 byte boundary
Example
d = vec_ld(0, data);
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
d = vec_ld(16, data);
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
data
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
d = vec_lvsl(a,b)
Does not perform any loading at all!!! Can be use to determine whether the pointer is aligned relative to the 16 byte vector boundary.
d = vec_lvsl(4,data)
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
data
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
d = vec_perm(a,b,c)
Think [a,b] is a 32 byte long vector. The indices of the bytes in b is from 16 to 31. c is an index array.
I1
I2 sum
sum
sum = vec_sum4s(I1,I2)
I1
I2 sum
Example 2: strcmp
int strcmp(const char* str1, const char* str2 );
Returns + if str1>str2, 0 if str1==str2, and - if str1<str2
int strcmp ( const char * str1, const char * str2 ){ int size1 = strlen(str1); int size2 = strlen(str2); int N = min(size1,size2); for (int i =0; i<N; i++){ if (str1[i]>str2[i]) return 1; else if (str1[i]<str2[i]) return -1; } if (size1==size2) return 0; if(size1>size2) return 1; return -1; }
Vector Version
Lets assume that both str1 and str2 are aligned at 16 boundaries. Basic idea:
(1) Check the equality of two vectors (2) If not, then check element by element.
Vector Version
Replace scalar variable num(i) by a vector How to perform the swap function? tmp=num(j-1);num(j)=num(j-1);num(j)=tmp;
Use vec_ld and vec_st EX: vec_ld(vec,j*16, num); vec_st(vec,j*16, num) What if num is not aligned on a 16 byte boundary?
Two stages
1. Order the vectors, such that all larger elements in one vector and all smaller elements in another. (Inter-vector sorting)
EX: turn
25 23 21 16 20 15 21 18
into
25 21 23 21 20 15 18 16
Inter-vector Sort
Two functions: vec_min and vec_max
Returns a vector containing min(or max) elements in each position EX: vec_max({25,23,21,16}{20,15,21,18}) ={25,23,21,18} EX: vec_min({25,23,21,16}{20,15,21,18}) ={20,15,21,16}
Rotate a Vector
25 23 21 16 vec_max 20 15 21 18 vec_min 25 23 21 18 vec_max
15 21 16 20 vec_min
25 23 21 18 20 15 18 16 25 23 21 20
15 21 16 18 25 23 21 20
25 23 21 20 vec_max 21 16 20 15
vec_min 21 16 20 15
25 23 21 21 16 20 15 20
We can use vec_perm to rotate a vector The index vector is {4,5,6,7,8,9,10, 11,12,13,14,15, 0,1,2,3}
25 23 21 20 vec_max 16 20 15 21
vec_min
Intra-vector Sort
Rely on four functions
d = vec_cmpgt(a,b): compares elements of a and b, if a[i]>=b[i], d[i]=F8. Otherwise, d[i]=0,for i=0,1,2,3. d = vec_and(a,b): d[i] = a[i]&b[i]
bit level AND
How to do that?
For example, sort {12,7,-5,9}
Some Analysis
How many comparisons do we need?
(0,1),(0,2),(0,3),(1,2),(1,3),(2,3)
What do we miss?
Sorting Network
Step 1: {(0,1)(2,3)} 0 1 Step 2: {(0,2)(1,3)} 2 Step 3: {(1,2)} 3 Exercise: whats the sorting network if we sort {(0,3), (1,2)} first? And {(0,2), (1,3)} first? How to make comparison of ,-?
Need to compare elements using vec_cmpgt Need to exchange data according to the result
Exercises
How to design the index array for {(0,2)(1,3)}?
base={0,1,2,3,5,6,7,8, 0,1,2,3,5,6,7,8} mask={8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}
Homework
Read textbook chap 9. Implement "quick sort" or "merge sort"
Implement the sequential code Use vectorized statements. Compare the performance for different implementations and to the insertion sort in the textbook