0% found this document useful (0 votes)
8 views

hpc_architecture

Uploaded by

Rajul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

hpc_architecture

Uploaded by

Rajul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Processor Architecture

Victor Eijkhout

Fall 2023

1 Eijkhout – Processor Architecture – Fall 2023


Justification

The performance of a parallel code has as one component the


behaviour of the single processor or single-threaded code. In this
section we discuss the basics of how a processor executes
instructions, and how it handles the data these instructions operate on.

2 Eijkhout – Processor Architecture – Fall 2023


Structure of a modern processor

3 Eijkhout – Processor Architecture – Fall 2023


Von Neumann machine

The ideal processor:


(Stored program)
An instruction contains the operation and two operand locations
Processor decodes instruction, gets operands, computes and
writes back the result
Repeat

4 Eijkhout – Processor Architecture – Fall 2023


The actual state of affairs

Single instruction stream versus multiple cores / floating point


units
Single instruction stream versus Instruction Level Parallelism
Unit-time-addressable memory versus large latencies
Modern processors contain lots of magic to make them seem like Von
Neumann machines.

5 Eijkhout – Processor Architecture – Fall 2023


Complexity measures

Traditional: processor speed was paramount. Operation counting.


Nowadays: memory is slower than processors
This course
Study data movement aspects
Algorithm design for processor reality

6 Eijkhout – Processor Architecture – Fall 2023


A first look at a processor

7 Eijkhout – Processor Architecture – Fall 2023


Structure of a core

8 Eijkhout – Processor Architecture – Fall 2023


Motivation for pipelining

An operation consists of several stages.


Addition:
Decoding the instruction operands.
Data fetch into register
Aligning the exponents:

.35 × 10−1 + .6 × 10−2 becomes


−1
.35 × 10 + .06 × 10 .− 1

Adding mantissas, giving .41.


Normalizing the result, giving .41 × 10−1 .
Storing the result.
pipeline stages

9 Eijkhout – Processor Architecture – Fall 2023


Pipelining, pictorially

Discrete hardware for each stage:

10 Eijkhout – Processor Architecture – Fall 2023


Analysis

Operation timing:

n operations

ℓ number of stages ⇒ t (n) = nℓτ

τ clock cycle

With pipelining:
t (n) = [s + ℓ + n − 1]τ

where s is a setup cost


⇒ Asymptotic speedup is ℓ
n1/2 : value for which speedup is ℓ/2

11 Eijkhout – Processor Architecture – Fall 2023


Applicability of pipelining

Pipelining works for:


vector addition/multiplication
Division/square root maybe pipelined, but much slower

12 Eijkhout – Processor Architecture – Fall 2023


Recurrences

Pipelining does not immediately work:

for (i) {
x[i+1] = a[i]*x[i] + b[i];
}

Transform:

xn+2 = an+1 xn+1 + bn+1


= an+1 (an xn + bn ) + bn+1
= an+1 an xn + an+1 bn + bn+1

13 Eijkhout – Processor Architecture – Fall 2023


Instruction pipeline

Instruction-Level Parallelism: more general notion of independent


instructions
Requires independent instructions
As frequency goes up, pipeline gets longer: more demands on
compiler

14 Eijkhout – Processor Architecture – Fall 2023


Instruction-Level Parallelism

multiple-issue of independent instructions


branch prediction and speculative execution
out-of-order execution
prefetching
Problems: complicated circuitry, hard to maintain performance

15 Eijkhout – Processor Architecture – Fall 2023


Implications

Long pipeline needs many independent instructions:


demands on compiler
Conditionals break the stream of independent instructions
Processor tries to predict branches
branch misprediction penalty:
pipeline needs to be flushed and refilled
avoid conditionals in inner loops!

16 Eijkhout – Processor Architecture – Fall 2023


Instructions

Addition/multiplication: pipelined
Division (and square root): much slower
for ( i )
a[i] = b[i] / c

Can you improve on this?


Fused Multiply-Add (FMA) s += a*b
where can you use this?

17 Eijkhout – Processor Architecture – Fall 2023


Peak performance

Performance is a function of
Clock frequency,

SIMD width

Load/store unit behavior

Floating point capabilities of several processor architectures


DAXPY cycle number for 8 operands
Processor year add/mult/fma units daxpy cycles
(count×width) (arith vs load/store)
MIPS R10000 1996 1×1+1×1+0 8/24
Alpha EV5 1996 1×1+1×1+0 8/12
IBM Power5 2004 0+0+2×1 4/12
AMD Bulldozer 2011 2×2+2×2+0 2/4
Intel Sandy Bridge 2012 1×4+1×4+0 2/4
Intel Haswell 2014 0+0+2×4 1/2

18 Eijkhout – Processor Architecture – Fall 2023


Memory hierarchy: caches, register, TLB.

19 Eijkhout – Processor Architecture – Fall 2023


The Big Story

DRAM memory is slow, so let’s put small SRAM close to the


processor
This helps if data is reused
Does the algorithm have reuse?
Does the implementation reuse data?

20 Eijkhout – Processor Architecture – Fall 2023


Bandwidth and latency

Important theoretical concept:


latency is delay between request for data and availability
bandwidth is rate at which data arrives thereafter

21 Eijkhout – Processor Architecture – Fall 2023


Memory hierarchy

22 Eijkhout – Processor Architecture – Fall 2023


Registers

23 Eijkhout – Processor Architecture – Fall 2023


Computing out of registers

a := b + c

load the value of b from memory into a register,


load the value of c from memory into another register,
compute the sum and write that into yet another register, and
write the sum value back to the memory location of a.

24 Eijkhout – Processor Architecture – Fall 2023


Register usage

Assembly code
(note: Intel two-operand syntax)

addl %eax, %edx

Registers are named


Can be explicitly addressed by the programmer
. . . as opposed to caches.
Assembly coding or inline assembly (compiler dependent)
. . . but typically generated by compiler

25 Eijkhout – Processor Architecture – Fall 2023


Examples of register usage

1. Resident in register

a := b + c
d := a + e
a stays resident in register, avoid store and load
2. subexpression elimination:

t1 = sin(alpha) * x + cos(alpha) * y;
t2 = -cos(alpha) * x + sin(alpha) * y;
becomes:

s = sin(alpha); c = cos(alpha);
t1 = s * x + c * y;
t2 = -c * x + s * y
often done by compiler

26 Eijkhout – Processor Architecture – Fall 2023


Caches

27 Eijkhout – Processor Architecture – Fall 2023


Cache basics

Fast SRAM in between memory and registers: mostly serves data


reuse

... = ... x ..... // instruction using x


......... // several instructions not involving x
... = ... x ..... // instruction using x

load x from memory into cache, and from cache into register;
operate on it;
do the intervening instructions;
request x from memory, but since it is still in the cache, load it
from the cache into register; operate on it.
essential concept: data reuse

28 Eijkhout – Processor Architecture – Fall 2023


Cache levels

Levels 1,2,3(,4): L1, L2, etc.


Increasing size, increasing latency, decreasing bandwidth
(Note: L3/L4 can be fairly big; beware benchmarking)
Cache hit / cache miss: one level is consulted, then the next
L1 has separate data / instruction cache, other levels mixed
Caches do not have enough bandwidth to serve the processor:
coding for reuse on all levels.

29 Eijkhout – Processor Architecture – Fall 2023


Cache misses

Compulsory miss: first time data is referenced


Capacity miss: data was in cache, but has been flushed
(overwritten) by LRU policy
Conflict miss: two items get mapped to the same cache location,
even if there are no capacity problems
Invalidation miss: data becomes invalid because of activity of
another core

30 Eijkhout – Processor Architecture – Fall 2023


Cache hits

Data has been requested, used a second time: temporal locality


⇒ Can’t wait too long between uses
(Data can be loaded because it’s close to data requested: spatial
locality. Later.)

31 Eijkhout – Processor Architecture – Fall 2023


Capacity miss

(Why is that last block going where it is going?)

32 Eijkhout – Processor Architecture – Fall 2023


Cache capacity

Loading data multiple times


LRU: oldest item evicted if needed
Reuse if not too much data

for ( lots of times ) // sequential loop


load and process data // probably parallel loop

33 Eijkhout – Processor Architecture – Fall 2023


Illustration of capacity

12 2.0

10
1.8

8
Cache miss fraction

1.6

cycles per op
6
1.4
4

1.2
2

00 5 10 15 20 25 301.0
dataset size

34 Eijkhout – Processor Architecture – Fall 2023


Replacement policies

What determines where new data goes /


what old data is overwritten?
Least Recently Used (LRU): most common
First-In / First-Out (FIFO): IBM Power4. Not a good idea.
Random Replacement. Sometimes used.
It’s actually more subtle than pure LRU . . .

35 Eijkhout – Processor Architecture – Fall 2023


Cache lines

Memory requests go by byte or word


Memory transfers go by cache line:
typically 64 bytes / 8 double precision numbers
Cache line transfer costs bandwidth
⇒ important to use all elements

36 Eijkhout – Processor Architecture – Fall 2023


Effects of striding

Always 8 numbers transferred


With stride s > 1: 8/s elements used
Loss of efficiency if bandwidth-limited

37 Eijkhout – Processor Architecture – Fall 2023


Cache line use

for (i=0; i<N; i++)


... = ... x[i] ...

for (i=0; i<N; i+=stride)


... = ... x[i] ...

38 Eijkhout – Processor Architecture – Fall 2023


Stride effects

for (i=0,n=0; i<L1WORDS; i++,n+=stride)


array[n] = 2.3*array[n]+1.2;

7 300

6
250

5
cache line utilization

200

total kcycles
4
150
3

100
2

11 2 3 4 5 6 750
stride

39 Eijkhout – Processor Architecture – Fall 2023


Cache mapping

Cache is smaller than memory, so we need a mapping scheme


memory address 7→ cache address
Ideal: any address can go anywhere; LRU policy for replacement
pro: optimal; con: slow, expensive to manufacture
Simple: direct mapping by truncating addresses
pro: fast and cheap; con: I’ll show you in a minute
Practical: limited associativity
‘enough but not too much’

40 Eijkhout – Processor Architecture – Fall 2023


Direct mapping

Direct mapping of 32-bit addresses into a 64K cache


Use last number of bits to find cache address
If you traverse an array, a contiguous chunk will be mapped to
cache without conflict.
If (memory) addresses are cache size apart, they get mapped to
the same cache location

41 Eijkhout – Processor Architecture – Fall 2023


Conflicts

Mapping conflicts in a direct-mapped cache.

42 Eijkhout – Processor Architecture – Fall 2023


The problem with direct mapping

real*8 A(8192,3);
do i=1,512
a(i,3) = ( a(i,1)+a(i,2) )/2
end do

In each iteration 3 elements map to the same cache location:


constant overwriting (‘eviction’, cache thrasing):
low performance

43 Eijkhout – Processor Architecture – Fall 2023


Associative cache mapping

Allow each memory address to go to multiple (but not all) cache


addresses; typically 2,4,8
Prevents problems with multiple arrays
Reasonable fast
Often lower associativity for L1 than L2, L3

Associativity L1 L2
Intel (Woodcrest) 8 8
AMD (Bulldozer) 2 8

44 Eijkhout – Processor Architecture – Fall 2023


Associativity

Associative cache structure

45 Eijkhout – Processor Architecture – Fall 2023


Illustration of associativity

Two caches of 12 elements: direct mapped (left) and 3-way


associative (right)
Direct map: 0–12 is conflict
Associative: no conflict

46 Eijkhout – Processor Architecture – Fall 2023


Associativity in practice

m
∀j : yj = yj + ∑ xi ,j
i =1

47 Eijkhout – Processor Architecture – Fall 2023


The number of L1 cache misses and the number of cycles for each j
One remedy

Do not user powers of 2.

The number of L1 cache misses and the number of cycles for each j
column accumulation, vector length 4096 + 8

48 Eijkhout – Processor Architecture – Fall 2023


Exercise

Write a small cache simulator in your favorite language. Assume a


k -way associative cache of 32 entries and an architecture with 16 bit
addresses.

49 Eijkhout – Processor Architecture – Fall 2023


Exercise: vectorsum

Compare sequential performance to single-threaded OMP


For some problem sizes observe a difference in performance
Use Intel option -qopt-report=3 and inspect the report.
Compare different compilers: Intel 19 behaves differently from 24!
Also gcc13.

for ( int iloop=0; iloop<nloops; ++iloop ) {


for ( int i=0; i<vectorsize; ++i ) {
outvec[i] += invec[i]*loopcoeff[iloop];
}
}

Analyze and report

50 Eijkhout – Processor Architecture – Fall 2023


More memory system topics

51 Eijkhout – Processor Architecture – Fall 2023


Bandwidth / latency

Simple model for sending n words:

t = α + βn

Quoted bandwidth figures are always optimistic:


bandwidth shared between cores
not enough bandwidth for all cores:
⇒ speedup less than linear
bandwidth wasted on coherence
NUMA: pulling data from other socket
assumes optimal scheduling of DRAM banks

52 Eijkhout – Processor Architecture – Fall 2023


Prefetch

Do you have to wait for every item from memory?


Memory controller can infer streams: prefetch
Sometimes controllable through assembly, directives, libraries
(AltiVec)
One form of latency hiding

53 Eijkhout – Processor Architecture – Fall 2023


Memory pages

Memory is organized in pages:


Translation between logical address, as used by program, and
physical in memory
This serves virtual memory and relocatable code
so we need another translation stage.

54 Eijkhout – Processor Architecture – Fall 2023


Page translation: TLB

General page translation: slowish and expensive


Translation Look-aside Buffer (TLB) is a small list of frequently
used pages
Example of spatial locality: items on an already referenced page
are found faster

55 Eijkhout – Processor Architecture – Fall 2023


TLB misses

#define INDEX(i,j,m,n) i+j*m


array = (double*) malloc(m*n*sizeof(double));
/* traversal #2 */
for (i=0; i<m; i++)
for (j=0; j<n; j++)
array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;

56 Eijkhout – Processor Architecture – Fall 2023


TLB hits

#define INDEX(i,j,m,n) i+j*m


array = (double*) malloc(m*n*sizeof(double));
/* traversal #1 */
for (j=0; j<n; j++)
for (i=0; i<m; i++)
array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;

57 Eijkhout – Processor Architecture – Fall 2023


Little’s Law

Item loaded from memory, processed, new item loaded in


response
But this can only happen after latency wait
Items during latency are independent, therefore
Concurrency = Bandwidth × Latency.

58 Eijkhout – Processor Architecture – Fall 2023


Multicore issues

59 Eijkhout – Processor Architecture – Fall 2023


Why multicore

Quest for higher performance:


Not enough instruction parallelism for long pipelines
Two cores at half speed more energy-efficient than one at full
speed.
Multicore solution:
More theoretical performance
Burden for parallelism is now on the programmer

60 Eijkhout – Processor Architecture – Fall 2023


Dennard scaling

Scale down feature size by s:

Feature size ∼s
Voltage ∼s
Current ∼s
Frequency ∼ s−1

Miracle conclusion:

Power = V · I ∼ s2 ; Power density ∼ 1

Everything gets better, cooling problem stays the same


Opportunity for more components, higher frequency

61 Eijkhout – Processor Architecture – Fall 2023


Dynamic power

Charge q = CV
Work W = qV = CV 2 (1)
Power W /time = WF = CV 2 F
Two cores at half frequency:

Cmulti = 2C 
Fmulti = F /2 ⇒ Pmulti = P /4.
Vmulti = V /2

Same computation, less power

62 Eijkhout – Processor Architecture – Fall 2023


Multicore caches

63 Eijkhout – Processor Architecture – Fall 2023


The coherence problem

64 Eijkhout – Processor Architecture – Fall 2023


Cache coherence

Modified-Shared-Invalid (MSI) coherence protocol:


Modified: the cacheline has been modified
Shared: the line is present in at least one cache and is
unmodified.
Invalid: the line is not present, or it is present but a copy in
another cache has been modified.

65 Eijkhout – Processor Architecture – Fall 2023


Coherence issues

Coherence is automatic, so you don’t have to worry about it. . .


. . . except when it saps performance
Beware false sharing
writes to different elements of a cache line

66 Eijkhout – Processor Architecture – Fall 2023


Balance analysis

Sandy Bridge core can aborb 300 GB/s


4 DDR3/1600 channels provide 51 GB/s, difference has to come
from reuse
It gets worse: latency 80ns, bandwidth 51 GB/s,
Little’s law: parallelism 64 cache lines
However, each core only has 10 line fill buffers,
so we need 6–7 cores to provide the data for one core
Power: cores are 72%, uncore 17, DRAM 11.
Core power goes 40% to instruction handling, not arithmetic
Time for a redesign of processors and programming; see my
research presentation

67 Eijkhout – Processor Architecture – Fall 2023


Programming strategies for performance

68 Eijkhout – Processor Architecture – Fall 2023


How much performance is possible?

Performance limited by
Processor peak performance: absolute limit
Bandwidth: linear correlation with performance
Arithmetic intensity: ratio of operations per transfer
If AI high enough: processor-limited
otherwise: bandwidth-limited

69 Eijkhout – Processor Architecture – Fall 2023


P

erformance depends on algorithm:

70 Eijkhout – Processor Architecture – Fall 2023


I

nsufficient utilization of functional units:

71 Eijkhout – Processor Architecture – Fall 2023


I

mperfect data transfer:

72 Eijkhout – Processor Architecture – Fall 2023


Spatial and temporal locality

Temporal locality: use an item, use it again but from cache


efficient because second transfer cheaper.
Spatial locality: use an item, then use one ‘close to it’
(for instance from same cacheline)
efficient because item is already reachable even though not used
before.

73 Eijkhout – Processor Architecture – Fall 2023


Architecture aware programming

Cache size: block loops


pipelining and vector instructions: expose streams of instructions
reuse: restructure code (both loop merge and splitting, unroll
TLB: don’t jump all over memory
associativity: watch out for powers of 2

74 Eijkhout – Processor Architecture – Fall 2023


Loop blocking

Multiple passes over data

for ( k< small bound )


for ( i < N )
x[i] = f( x[i], k, .... )

Block to be cache contained

for ( ii < N; ii+= blocksize )


for ( k< small bound )
for ( i=ii; i<ii+blocksize; i++ )
x[i] = f( x[i], k, .... )

This requires independence of operations

75 Eijkhout – Processor Architecture – Fall 2023


The ultimate in performance programming: DGEMM

Matrix-matrix product C = A · B

∀i ∀j ∀k : cij + = aik bkj

Three independent loop i , j , k


all three blocked i ′ , j ′ , k ′
Many loop permutations, blocking factors to choose

76 Eijkhout – Processor Architecture – Fall 2023


DGEMM variant

Inner products

for ( i )
for ( j )
for ( k )
c[i,j] += a[i,k] * b[k,j]

77 Eijkhout – Processor Architecture – Fall 2023


DGEMM variant

Outer product: updates with low-rank columns-times-vector

for ( k )
for ( i )
for ( j )
c[i,j] += a[i,k] * b[k,j]

78 Eijkhout – Processor Architecture – Fall 2023


DGEMM variant

Building up rows by linear combinations

for ( i )
for ( k )
for ( j )
c[i,j] += a[i,k] * b[k,j]

Exchanging i , j: building up columns

79 Eijkhout – Processor Architecture – Fall 2023


Rank 1 updates

C∗∗ = ∑ A∗k Bk ∗
k

80 Eijkhout – Processor Architecture – Fall 2023


Matrix-panel multiply

Block of A times ‘sliver’ of B

81 Eijkhout – Processor Architecture – Fall 2023


Inner algorithm

For inner i:

// compute C[i,*] :
for k:
C[i,*] = A[i,k] * B[k,*]

82 Eijkhout – Processor Architecture – Fall 2023


Tuning

For inner i:

// compute C[i,*] :
for k:
C[i,*] += A[i,k] * B[k,*]

C[i,*] stays in register


A[i,k] and B[k,*] stream from L1
blocksize of A for L2 size
A stored by rows to prevent TLB problems

83 Eijkhout – Processor Architecture – Fall 2023


Cache-oblivious programming

Observation: recursive subdivision will ultimately make a problem


small / well-behaved enough

84 Eijkhout – Processor Architecture – Fall 2023


Cache-oblivious matrix-matrix multiply

    
C11 C12 A11 A12 B11 B12
=
C21 C22 A21 A22 B21 B22
with C11 = A11 B11 + A12 B21
Recursive approach will be cache contained.
Not as high performance as being cache-aware. . .

85 Eijkhout – Processor Architecture – Fall 2023


The power question

86 Eijkhout – Processor Architecture – Fall 2023

You might also like