CS6461 - Computer Architecture Fall 2016 - Vector Operations
CS6461 - Computer Architecture Fall 2016 - Vector Operations
Fall 2016
Adapted from Professor Stephen H. Kaislers Slides
Initialize I = 0
20 Read B(I)
Read C(I)
Store A(I) = B(I) + C(I)
Increment I = I + 1
If I <= 100 Go to 20
Operation
Multiply B(1)
mantissa C(1)
...
Add B(1)
exponents C(1)
...
Normal.
result
A(1) ...
Put
sign
A(1) ... A(N)
Vector-Register Processors:
All vector operations (except load and store) occur in the
vector registers.
Vector counterpart of a load-store architecture
All major vector computers (Cray machines, NEC SX/2 ~
SX/5, Fujitsu VP200, etc.)
Memory-Memory Processors:
All vector operations are memory to memory.
CDC vector computers: CDC 203, CDC 205, TI ASC
All are obsolete!
Vector processor
Memory
Mask-
Unit registers MASK
Vector pipelines
CSCI 6461 Computer Architecture 9
Basic Vector-Register Processor Architecture
Main Memory
FP add/subtract
FP multiply
Vector load-store
FP divide
Integer
Vector
registers Logical
A scalar processor
Scalar register file
Scalar functional units (arithmetic, load/store, etc)
A vector register file (a 2D register array)
Each register is an array of elements, e.g. 32 registers with 32 64-bit
elements per register
MVL = maximum vector length = max # of elements per register
A set of pipelined vector functional units: Integer, FP, load/store, etc
Sometimes vector and scalar units are combined (share ALUs)
Three types of addressing
Unit stride
Contiguous block of information in memory
Fastest: always possible to optimize this
Non-unit (constant) stride
Harder to optimize memory system for all possible strides
Prime number of data banks makes it easier to support different strides at full
bandwidth
Indexed (gather-scatter)
Vector equivalent of register indirect
Good for sparse arrays of data
Increases number of programs that vectorize
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
Suppose:
MULV V1,V2,V3
ADDV V4,V1,V5
chaining: vector register (V1) is not as a single entity
but as a group of individual registers, then pipeline
forwarding can work on individual elements of a
vector
Flexible chaining: allow vector to chain to any other
active vector operation => more read/write ports, e.g.
pass the result from one vector operation to another
vector operation
As long as enough HW, increases convoy size
CSCI 6461 Computer Architecture 27
Vector Register Bypassing
Suppose:
do 100 i = 1,n
100 A(K(i)) = A(K(i)) + C(M(i))
Sony Playstation 3
Partnership between Sony,
Toshiba, IBM
Power PC-based main core (PPE)
Multiple SPEs
On die memory controller
Inter-core transport bus
High speed IO
Clocked at 3-4ghz
256GFLOPS Single Precision @
4ghz
Offload a large amount of work
onto compiler / software.