Systolic Architecture
Systolic Architecture
M M
PE
PE PE PE
Different from pipelining
Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory
Different from SIMD: each PE may do something different
Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs)
Represent algorithms directly by chips connected in regular pattern
C=AXB
Systolic Array Example:3x3 Systolic Array Matrix
Multiplication
• Processors arranged in a 2-D grid b2,2
• Each processor accumulates one b2,1 b1,2
element of the product b2,0 b1,1 b0,2
b1,0 b0,1
Alignments in time b0,0
Columns of B
Rows of A
b1,0 b0,1
a0,0*b0,0 a0,0*b0,1
a0,1 + a0,1*b1,0 a0,0
a0,2
b0,0
a1,0*b0,0
a1,2 a1,1 a1,0
T=2
Systolic Array Example:
3x3 Systolic Array Matrix
Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
b2,2
Alignments in time
b2,1 b1,2
b2,0 b1,1 b0,2
a0,0*b0,0 a0,0*b0,1
a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 a0,0*b0,2
+ a0,2*b2,0
b1,0 b0,1
a1,0*b0,0
a1,1 a1,0 a1,0*b0,1
a1,2 + a1,1*b1,0
b0,0
a2,0*b0,0
a2,0
a2,2 a2,1
T=3
Systolic Array Example:
3x3 Systolic Array Matrix
Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
b2,2
b2,1 b1,2
a0,0*b0,0 a0,0*b0,1
+ a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
b1,0 b0,1
a2,0*b0,1
a2,2 a2,1 a2,0*b0,0
+ a2,1*b1,0
a2,0
T=4
Systolic Array Example:
3x3 Systolic Array Matrix
Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
b2,2
a0,0*b0,0 a0,0*b0,1
+ a0,1*b1,0 + a0,1*b1,1 a0,2 a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
b2,1 b1,2
a1,0*b0,0
+ a1,1*b1,0 a1,2 a1,0*b0,1 a1,1 a1,0*b0,2
+ a1,1*b1,2
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1
T=5
Systolic Array Example:
3x3 Systolic Array Matrix
Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
a0,0*b0,0 a0,0*b0,1
a0,0*b0,2
+ a0,1*b1,0 + a0,1*b1,1
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
b2,2
a1,0*b0,0
a1,0*b0,2
+ a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1 + a1,2*b2,2
b2,1 b1,2
a2,0*b0,1 a2,0*b0,2
a2,0*b0,0
+ a2,1*b1,0
a2,2 + a2,1*b1,1 a2,1 + a2,1*b1,2
+ a2,2*b2,0 + a2,2*b2,1
T=6
Systolic Array Example:
3x3 Systolic Array Matrix
Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
a0,0*b0,0 a0,0*b0,1
a0,0*b0,2
+ a0,1*b1,0 + a0,1*b1,1
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
a1,0*b0,0
a1,0*b0,1 a1,0*b0,2
+ a1,1*b1,0
+a1,1*b1,1 + a1,1*b1,2
+ a1,2*a2,0
+ a1,2*b2,1 + a1,2*b2,2
Done
b2,2
a2,0*b0,1 a2,0*b0,2
a2,0*b0,0
+ a2,1*b1,0 + a2,1*b1,1 a2,2 + a2,1*b1,2
+ a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2
T=7