0% found this document useful (0 votes)

48 views

PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook

This document discusses programming basics for the Cell Broadband Engine architecture's Synergistic Processor Elements (SPEs) using Single Instruction Multiple Data (SIMD) instructions. It covers an overview of the Cell architecture components, the PowerPC Processor Element (PPE) and SPEs, and vector/SIMD programming on the PPE. Examples provided demonstrate using vector intrinsics for tasks like array summing, string comparison, and insertion sort. Vector permutation, loading, comparison, and exchange operations are key techniques discussed for implementing parallel SIMD algorithms.

Uploaded by

jayantbildani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook

Uploaded by

jayantbildani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

PS3 programming basics

Week 1. SIMD programming on PPE Materials are adapted from the textbook

Overview of the Cell Architecture

XIO: Rambus Extreme Data Rate (XDR) I/O (XIO) memory channels

The PowerPC Processor Element

PPU Set of 64-bit registers 32 128-bit registers A 32-KB L1 I-cache A 32-KB L1 D-cache Two simultaneous threads of execution and can be viewed as a 2-way multiprocessor with shared dataflow. PPSS A unified 512-KB L2 I+D cache Various queues A bus interface unit

Synergistic Processor Elements

SPU 128 registers (each one 128 bits wide), 256-KB local store has its own program counter and is optimized to run SPE threads spawned by the PPE MFC DMA transfers to move instructions and data between the SPUs LS and main storage.

Element Interconnect Bus

Threads and tasks

Term PPE thread SPE thread Definition A Linux thread running on a PPE. A Linux thread running on an SPE. Each such thread has its own SPE context which includes the 128 x 128-bit register file, program counter, and MFC Command Queues, and can communicate with other execution units (or with effective-address memory through the MFC channel interface). Cell A task running on the PPE and SPE. Broadband Each such task has one or more Linux threads. All Engine task the threads within the task share tasks resources.

Vector/SIMD Extension unit

The 128-bit Vector/SIMD Multimedia Extension unit (VXU) operates concurrently with the PPUs fixed-point integer unit (FXU) and floating-point execution unit (FPU).

PPU SIMD PROGRAMMING BASICS

Vector instrinsic functions

Specific: have a 1-1 mapping with a single assembly-language instruction
EX: vec_abs(a)

Generic: map to one or more assembly-language instructions

EX: vec_or(a,b),

Predicates: compare values and return an integer that may be used directly for branching
EX: vec_all_eq(a,b), vec_any_eq(a,b)

Vector data types

The vector registers are 128 bits and can contain Sixteen 8-bit values, signed or unsigned
EX: vector unsigned char

Eight 16-bit values, signed or unsigned

EX: vector unsigned short

Four 32-bit values, signed or unsigned

EX: vector unsigned int

Four single-precision IEEE-754 floating-point

EX: vector float

Big-endian byte and bit ordering

A general approach to get data

typedefs a union of an array of four ints and a vector of signed ints.
#include <stdio.h> // Define a type that can be an array of ints or a vector. typedef union { int iVals[4]; vector signed int myVec; } vecVar;

How to use it?

int main() { vecVar v1, v2, vConst; // define variables
// load the literal value 2 into the 4 positions in vConst, vConst.myVec = (vector signed int){2, 2, 2, 2}; // load 4 values into the 4 element of vector v1 v1.myVec = (vector signed int){10, 20, 30, 40}; // call vector add function v2.myVec = vec_add( v1.myVec, vConst.myVec ); // see what we got! printf("\nResults:\nv2[0] = %d, v2[1] = %d, v2[2] = %d, v2[3] = %d\n\n", v2.iVals[0], v2.iVals[1], v2.iVals[2], v2.iVals[3]); return 0; }

__attribute__(alligned())
Variables are aligned at a boundary corresponding to its datatype size
The datatype size of vector is 16 (bytes)

When declaring a variable, you can assign its alignment by attribute(aligned())

EX: int var ___attribute__(aligned(8)) A valid address will be like 0x0FFFFFF8 or 0x0FFFFFF0

Vector Add Operations

vector signed int VA,VB,VC; VC = vec_add(VA,VB);

Example 1: array-summing
Traditional approach
// 16 iterations of a loop int rolled_sum(unsigned char bytes[16]) { int i; int sum = 0; for (i = 0; i < 16; ++i) { sum += bytes[i]; } return sum; }

Vector Version (no loop)

// Vectorized for Vector/SIMD Multimedia Extension int vectorized_sum(unsigned char data[16]) { vector unsigned char temp; union { int i[4]; vector signed int v; } sum; vector unsigned int zero = (vector unsigned int){0}; // Perform a misaligned vector load of the 16 bytes. temp = vec_perm(vec_ld(0, data), vec_ld(16, data), vec_lvsl(0, data)); // Sum the 16 bytes of the vector sum.v = vec_sums((vector signed int)vec_sum4s(temp, zero), (vector signed int)zero); // Extract the sum and return the result. return (sum.i[3]); }

Function Description
Functions d = vec_perm(a,b,c) d = vec_ld(a,b) d = vec_lvsl(a,b) d = vec_sums(a,b) d = vec_sum4s(a,b) Explanation Vector Permute Vector Load Indexed Vector Load for Shift Left Vector Sum Saturated Vector Sum Across Partial (1/4) Saturated

d = vec_ld(a,b)
Load 16 bytes from memory and return to d a (an integer) is added to the address of b (a pointer), and the sum is truncated to a multiple of 16 bytes. The result is the contents of the 16 bytes of memory starting at this address.
If the address is not aligned on a 16 bytes boundary, d is loaded from the next-lowest 16 byte boundary

Example
d = vec_ld(0, data);
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

d = vec_ld(16, data);
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

data
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

d = vec_lvsl(a,b)
Does not perform any loading at all!!! Can be use to determine whether the pointer is aligned relative to the 16 byte vector boundary.
d = vec_lvsl(4,data)
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

data

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

d = vec_perm(a,b,c)
Think [a,b] is a 32 byte long vector. The indices of the bytes in b is from 16 to 31. c is an index array.

vec_sums, vec_sum2s, vec_sum4s

sum = vec_sums(I1,I2) sum = vec_sum2s(I1,I2)
I1
I2

I1
I2 sum

sum

sum = vec_sum4s(I1,I2)
I1

I2 sum

Example 2: strcmp
int strcmp(const char* str1, const char* str2 );
Returns + if str1>str2, 0 if str1==str2, and - if str1<str2
int strcmp ( const char * str1, const char * str2 ){ int size1 = strlen(str1); int size2 = strlen(str2); int N = min(size1,size2); for (int i =0; i<N; i++){ if (str1[i]>str2[i]) return 1; else if (str1[i]<str2[i]) return -1; } if (size1==size2) return 0; if(size1>size2) return 1; return -1; }

Vector Version
Lets assume that both str1 and str2 are aligned at 16 boundaries. Basic idea:
(1) Check the equality of two vectors (2) If not, then check element by element.

Use vec_all_eq for (1)

vec_all_eq(a,b) returns 1 if all the element of a and b are equal. Otherwise, it returns 0

Example3: Insertion Sort

EX: sort an array num[] in ascending order
Insert num(i) to the sorted list num(1:i-1)
for(i=1; i<N; i++) for(j=i;j>0;j--) if(num(j-1)>num(j)) swap(num(j-1),num(j)); else break;

Vector Version
Replace scalar variable num(i) by a vector How to perform the swap function? tmp=num(j-1);num(j)=num(j-1);num(j)=tmp;
Use vec_ld and vec_st EX: vec_ld(vec,j*16, num); vec_st(vec,j*16, num) What if num is not aligned on a 16 byte boundary?

How about the comparison?

Can vec_all_gt work?

Two stages
1. Order the vectors, such that all larger elements in one vector and all smaller elements in another. (Inter-vector sorting)
EX: turn
25 23 21 16 20 15 21 18

into

25 21 23 21 20 15 18 16

What is the sequential code to do that?

2. Order the elements inside the individual vectors. (Intra-vector sorting)

Inter-vector Sort
Two functions: vec_min and vec_max
Returns a vector containing min(or max) elements in each position EX: vec_max({25,23,21,16}{20,15,21,18}) ={25,23,21,18} EX: vec_min({25,23,21,16}{20,15,21,18}) ={20,15,21,16}

Almost is what we need, except

Rotate a Vector
25 23 21 16 vec_max 20 15 21 18 vec_min 25 23 21 18 vec_max
15 21 16 20 vec_min

25 23 21 18 20 15 18 16 25 23 21 20
15 21 16 18 25 23 21 20

25 23 21 20 vec_max 21 16 20 15

vec_min 21 16 20 15
25 23 21 21 16 20 15 20

We can use vec_perm to rotate a vector The index vector is {4,5,6,7,8,9,10, 11,12,13,14,15, 0,1,2,3}

25 23 21 20 vec_max 16 20 15 21

vec_min

Intra-vector Sort
Rely on four functions
d = vec_cmpgt(a,b): compares elements of a and b, if a[i]>=b[i], d[i]=F8. Otherwise, d[i]=0,for i=0,1,2,3. d = vec_and(a,b): d[i] = a[i]&b[i]
bit level AND

d = vec_and(a,b): d[i] = a[i]+b[i] d = vec_perm(a,b,c): we had learned it.

How to do that?
For example, sort {12,7,-5,9}

Some Analysis
How many comparisons do we need?
(0,1),(0,2),(0,3),(1,2),(1,3),(2,3)

Which can be compared (sorted) in parallel?

For example: {(0,1), (2,3)}, {(0,2), (1,3)},{(0,3), (1,2)}

What can we get if {(0,1), (2,3)} is sorted first?

We get A[0] A[1] and A[2]A*3+. Whats next?

What can we get after {(0,2), (1,3)} is sorted?

A[0]A[1], A[2]A[3] (why?) A[0]A[2],A[1]A[3].

What do we miss?

Sorting Network
Step 1: {(0,1)(2,3)} 0 1 Step 2: {(0,2)(1,3)} 2 Step 3: {(1,2)} 3 Exercise: whats the sorting network if we sort {(0,3), (1,2)} first? And {(0,2), (1,3)} first? How to make comparison of ,-?
Need to compare elements using vec_cmpgt Need to exchange data according to the result

EX: Compare {(0,1),(2,3)}

b=vec_perm(a,a,{4,5,6,7,0,1,2,3,12,13,14,15,8,9,10,11}); //b[0]=a[1], b[1]=a[0], b[2]=a[3], b[3]=a[2] d = vec_cmpgt(a,b) // For a={12,7,-5,9}, b={7,12,9,-5} d={F8,0,0,F8}

Exercise: what is the index array if we want to compare {(0,2),(1,3)} or {(1,2)}?

For {(0,2),(1,3)}, {8,9,10,11,0,1,2,3,12,13,14,15,4,5,6,7} For {(1,2)}, {0,1,2,3,8,9,10,11,4,5,6,7,12,13,14,15}

Vector Comparison Functions

We need to exchange data if d[0]=F8 or d[2]=F8 The only way to exchange data is by vec_perm.
How to design the index array c?
If d=={0,F8,0,F8}, c={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} If d=={F8,0,0,F8}, c={4,5,6,7,0,1,2,3,8,9,10,11,12,13,14,15} If d=={0,F8,F8,0}, c={0,1,2,3,4,5,6,7,12,13,14,15,8,9,10,11} If d=={F8,0,F8,0}, c={4,5,6,7,0,1,2,3,12,13,14,15,8,9,10,11}

One possible way to generate c = base+mask

base can be {0,1,2,3,0,1,2,3,8,9,10,11,8,9,10,11} mask can be and(d,{4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4})

Exercises
How to design the index array for {(0,2)(1,3)}?
base={0,1,2,3,5,6,7,8, 0,1,2,3,5,6,7,8} mask={8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}

How to design the index array for {(0,3)(1,2)}?

base={0,1,2,3,5,6,7,8, 5,6,7,8,0,1,2,3} mask={12,12,12,12,4,4,4,4,4,4,4,4,12,12,12,12}

How to design the index array for {(1,2)}?

base={0,1,2,3,5,6,7,8, 5,6,7,8,12,13,14,15} mask={0,0,0,0, 4,4,4,4,4,4,4,4,0,0,0,0}

Homework
Read textbook chap 9. Implement "quick sort" or "merge sort"
Implement the sequential code Use vectorized statements. Compare the performance for different implementations and to the insertion sort in the textbook

THE CULTURAL IMPORTANCE F THE ARTS by Susanne Langer
100% (3)
THE CULTURAL IMPORTANCE F THE ARTS by Susanne Langer
5 pages
CS3330 - A Quick Guide To SSE - SIMD
No ratings yet
CS3330 - A Quick Guide To SSE - SIMD
9 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
CIS 190: C/C++ Programming: Vectors, Enumeration, Overloading, and More!
No ratings yet
CIS 190: C/C++ Programming: Vectors, Enumeration, Overloading, and More!
95 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
Lesson 11 Vectors
No ratings yet
Lesson 11 Vectors
21 pages
Vectors in C++
No ratings yet
Vectors in C++
13 pages
Comparing C++ Compilers Parallel-Programming Performance
No ratings yet
Comparing C++ Compilers Parallel-Programming Performance
8 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
Assignment 3: Vector and Hashset
No ratings yet
Assignment 3: Vector and Hashset
17 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
LectureNote_03
No ratings yet
LectureNote_03
17 pages
SIMD
No ratings yet
SIMD
44 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
Computer Architecture AllClasses-Outline-199-294
No ratings yet
Computer Architecture AllClasses-Outline-199-294
96 pages
Review On Embedded C
No ratings yet
Review On Embedded C
37 pages
Programmation Orientée Objet Par C++
No ratings yet
Programmation Orientée Objet Par C++
14 pages
Slides CS101 6 Dynamic Memory Allocation
No ratings yet
Slides CS101 6 Dynamic Memory Allocation
29 pages
02 Assembly
No ratings yet
02 Assembly
43 pages
C++ STL
No ratings yet
C++ STL
53 pages
Lecture 5
No ratings yet
Lecture 5
2 pages
2016 Esc SV Efficient Embedded Programming MG
No ratings yet
2016 Esc SV Efficient Embedded Programming MG
35 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
19_Computer_Architecture_Vector_processor
No ratings yet
19_Computer_Architecture_Vector_processor
20 pages
vcl_manual
No ratings yet
vcl_manual
96 pages
Riscv V Spec 1.0 Rc2
No ratings yet
Riscv V Spec 1.0 Rc2
112 pages
Embedded C
No ratings yet
Embedded C
9 pages
SV Lab Manual
No ratings yet
SV Lab Manual
20 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
BSIT 22 Main PG 1 168
No ratings yet
BSIT 22 Main PG 1 168
168 pages
Lecture04 Machine Programming 4 Advanced
No ratings yet
Lecture04 Machine Programming 4 Advanced
30 pages
W6-Vectors and Arrays
No ratings yet
W6-Vectors and Arrays
6 pages
PPL Record - Mohamed Shalin, Cse, 17 Veltech, Chennai
No ratings yet
PPL Record - Mohamed Shalin, Cse, 17 Veltech, Chennai
86 pages
QRC0007 VFP
No ratings yet
QRC0007 VFP
2 pages
Advsimd 2021Q2
No ratings yet
Advsimd 2021Q2
387 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Phil Ottewell's STL Tutorial
No ratings yet
Phil Ottewell's STL Tutorial
48 pages
02_strings_vectors
No ratings yet
02_strings_vectors
6 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
lecture-5
No ratings yet
lecture-5
29 pages
3316
No ratings yet
3316
7 pages
Array Attributes: Type ROM Is Array (0 To 15, 7 Downto 0) of Bit Signal ROM1: ROM
No ratings yet
Array Attributes: Type ROM Is Array (0 To 15, 7 Downto 0) of Bit Signal ROM1: ROM
14 pages
Lab 02
No ratings yet
Lab 02
7 pages
PP Unit 2 Tesseract
No ratings yet
PP Unit 2 Tesseract
38 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Acle 2021Q2
No ratings yet
Acle 2021Q2
84 pages
Module 3
No ratings yet
Module 3
35 pages
High Speed electronics-UoH - 4-Vivado-Presentation
No ratings yet
High Speed electronics-UoH - 4-Vivado-Presentation
66 pages
SIMD v1
No ratings yet
SIMD v1
31 pages
RISC V VectorExtension 1 1
No ratings yet
RISC V VectorExtension 1 1
72 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
UI Testing With Protractor: Andrew Eisenberg Tasktop Technologies
No ratings yet
UI Testing With Protractor: Andrew Eisenberg Tasktop Technologies
16 pages
SS06 Sample Exam
No ratings yet
SS06 Sample Exam
10 pages
MSC Tech Indus Maths
No ratings yet
MSC Tech Indus Maths
135 pages
Abstract and Polymorphism
No ratings yet
Abstract and Polymorphism
10 pages
Asd Assessment
No ratings yet
Asd Assessment
37 pages
Bija Mantras
100% (1)
Bija Mantras
9 pages
P & PLC E C T: Ppendix
No ratings yet
P & PLC E C T: Ppendix
43 pages
English Drill 001
No ratings yet
English Drill 001
3 pages
Louise B. Brownell - Life Abundant For You - Ch1 How Do You Start Your Day
No ratings yet
Louise B. Brownell - Life Abundant For You - Ch1 How Do You Start Your Day
10 pages
Sf1 - 2022 - Grade 7 (Year I) - Masikap
No ratings yet
Sf1 - 2022 - Grade 7 (Year I) - Masikap
8 pages
Web Based Inventory Management System
No ratings yet
Web Based Inventory Management System
21 pages
TX 4000 Installation and Configuration Manual
No ratings yet
TX 4000 Installation and Configuration Manual
46 pages
Irreguler Verbs
No ratings yet
Irreguler Verbs
5 pages
Detailed Lesson in ENGLISH 2 (Centrel Main)
No ratings yet
Detailed Lesson in ENGLISH 2 (Centrel Main)
3 pages
How To Deepen Your Voice
100% (13)
How To Deepen Your Voice
1 page
APA Guide APA 7th Edition
No ratings yet
APA Guide APA 7th Edition
2 pages
Slot15 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot15 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
ONIMAI - I'm Now Your Sister! v02 (2021)
No ratings yet
ONIMAI - I'm Now Your Sister! v02 (2021)
155 pages
IEEE 12207 - SW Process Improvement Working Group2
No ratings yet
IEEE 12207 - SW Process Improvement Working Group2
40 pages
Narration KIPS Academy (Free Download)
No ratings yet
Narration KIPS Academy (Free Download)
17 pages
English Lessons3as Rayhane-05a-Past Habits
No ratings yet
English Lessons3as Rayhane-05a-Past Habits
2 pages
913.-The-Snowman-⛄️-Learn-English-with-a-Short-Story
No ratings yet
913.-The-Snowman-⛄️-Learn-English-with-a-Short-Story
43 pages
The Arden Shakespeare Miscellany Jane Armstrong download
100% (1)
The Arden Shakespeare Miscellany Jane Armstrong download
80 pages
Ivunit Query Processing
No ratings yet
Ivunit Query Processing
12 pages
Running: Option 2 - If You Are Using A Script To Start The Managed Server
No ratings yet
Running: Option 2 - If You Are Using A Script To Start The Managed Server
5 pages
Tp2 Universitas Indonesia Sat 13.00-18.00
No ratings yet
Tp2 Universitas Indonesia Sat 13.00-18.00
27 pages
Creating A Table (Tcode: Se11) Table Types: 1. Transparent Table 2. Pool Table 3. Cluster Table
No ratings yet
Creating A Table (Tcode: Se11) Table Types: 1. Transparent Table 2. Pool Table 3. Cluster Table
3 pages
MULTIPLE CHOICE QUESTIONS-sql
No ratings yet
MULTIPLE CHOICE QUESTIONS-sql
6 pages
Celebrating 20 Years of TACI and Kwar Adhola Foreword by Owor Jad Adrian
No ratings yet
Celebrating 20 Years of TACI and Kwar Adhola Foreword by Owor Jad Adrian
1 page
What Are Pronouns?
No ratings yet
What Are Pronouns?
16 pages
Dfc2033 - Lab Activity 3
25% (4)
Dfc2033 - Lab Activity 3
9 pages
SCHOOL BASED ACTION PLAN in READING
No ratings yet
SCHOOL BASED ACTION PLAN in READING
3 pages
Soal Ganjil Sasing 23 Xi
No ratings yet
Soal Ganjil Sasing 23 Xi
5 pages

PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook

Uploaded by

PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook

Uploaded by

PS3 programming basics

Overview of the Cell Architecture

The PowerPC Processor Element

Synergistic Processor Elements

Element Interconnect Bus

Threads and tasks

Vector/SIMD Extension unit

PPU SIMD PROGRAMMING BASICS

Vector instrinsic functions

Generic: map to one or more assembly-language instructions

Vector data types

Eight 16-bit values, signed or unsigned

Four 32-bit values, signed or unsigned

Four single-precision IEEE-754 floating-point

Big-endian byte and bit ordering

A general approach to get data

How to use it?

When declaring a variable, you can assign its alignment by __attribute__(aligned())

Vector Add Operations

Vector Version (no loop)

vec_sums, vec_sum2s, vec_sum4s

Use vec_all_eq for (1)

Example3: Insertion Sort

How about the comparison?

What is the sequential code to do that?

2. Order the elements inside the individual vectors. (Intra-vector sorting)

Almost is what we need, except

d = vec_and(a,b): d[i] = a[i]+b[i] d = vec_perm(a,b,c): we had learned it.

Which can be compared (sorted) in parallel?

What can we get if {(0,1), (2,3)} is sorted first?

What can we get after {(0,2), (1,3)} is sorted?

EX: Compare {(0,1),(2,3)}

Exercise: what is the index array if we want to compare {(0,2),(1,3)} or {(1,2)}?

Vector Comparison Functions

One possible way to generate c = base+mask

How to design the index array for {(0,3)(1,2)}?

How to design the index array for {(1,2)}?

You might also like

When declaring a variable, you can assign its alignment by attribute(aligned())