hpc_architecture
hpc_architecture
Victor Eijkhout
Fall 2023
Operation timing:
n operations
ℓ number of stages ⇒ t (n) = nℓτ
τ clock cycle
With pipelining:
t (n) = [s + ℓ + n − 1]τ
for (i) {
x[i+1] = a[i]*x[i] + b[i];
}
Transform:
Addition/multiplication: pipelined
Division (and square root): much slower
for ( i )
a[i] = b[i] / c
Performance is a function of
Clock frequency,
SIMD width
a := b + c
Assembly code
(note: Intel two-operand syntax)
1. Resident in register
a := b + c
d := a + e
a stays resident in register, avoid store and load
2. subexpression elimination:
t1 = sin(alpha) * x + cos(alpha) * y;
t2 = -cos(alpha) * x + sin(alpha) * y;
becomes:
s = sin(alpha); c = cos(alpha);
t1 = s * x + c * y;
t2 = -c * x + s * y
often done by compiler
load x from memory into cache, and from cache into register;
operate on it;
do the intervening instructions;
request x from memory, but since it is still in the cache, load it
from the cache into register; operate on it.
essential concept: data reuse
12 2.0
10
1.8
8
Cache miss fraction
1.6
cycles per op
6
1.4
4
1.2
2
00 5 10 15 20 25 301.0
dataset size
7 300
6
250
5
cache line utilization
200
total kcycles
4
150
3
100
2
11 2 3 4 5 6 750
stride
real*8 A(8192,3);
do i=1,512
a(i,3) = ( a(i,1)+a(i,2) )/2
end do
Associativity L1 L2
Intel (Woodcrest) 8 8
AMD (Bulldozer) 2 8
m
∀j : yj = yj + ∑ xi ,j
i =1
The number of L1 cache misses and the number of cycles for each j
column accumulation, vector length 4096 + 8
t = α + βn
Feature size ∼s
Voltage ∼s
Current ∼s
Frequency ∼ s−1
Miracle conclusion:
Charge q = CV
Work W = qV = CV 2 (1)
Power W /time = WF = CV 2 F
Two cores at half frequency:
Cmulti = 2C
Fmulti = F /2 ⇒ Pmulti = P /4.
Vmulti = V /2
Performance limited by
Processor peak performance: absolute limit
Bandwidth: linear correlation with performance
Arithmetic intensity: ratio of operations per transfer
If AI high enough: processor-limited
otherwise: bandwidth-limited
Matrix-matrix product C = A · B
Inner products
for ( i )
for ( j )
for ( k )
c[i,j] += a[i,k] * b[k,j]
for ( k )
for ( i )
for ( j )
c[i,j] += a[i,k] * b[k,j]
for ( i )
for ( k )
for ( j )
c[i,j] += a[i,k] * b[k,j]
C∗∗ = ∑ A∗k Bk ∗
k
For inner i:
// compute C[i,*] :
for k:
C[i,*] = A[i,k] * B[k,*]
For inner i:
// compute C[i,*] :
for k:
C[i,*] += A[i,k] * B[k,*]
C11 C12 A11 A12 B11 B12
=
C21 C22 A21 A22 B21 B22
with C11 = A11 B11 + A12 B21
Recursive approach will be cache contained.
Not as high performance as being cache-aware. . .