Memory in Multiprocessor System
Memory in Multiprocessor System
Z. Jerry Shi
Assistant Professor of Computer Science and Engineering
University of Connecticut
Parallel Computers
1
Parallel Processors “Religion”
2
Why Multiprocessors?
Parallel Architecture
3
Performance Metrics: Latency and Bandwidth
• Bandwidth
– Need high bandwidth in communication
– Match limits in network, memory, and processor
– Challenge is link speed of network interface vs. bisection bandwidth of
network
• Latency
– Affects performance, since processor may have to wait
– Affects ease of programming, since it requires more thought to overlap
communication and computation
– Overhead to communicate is a problem in many machines
• Latency Hiding
– How can a mechanism help hide latency?
– Increases programming system burden
– Examples: overlap message send with computation, prefetch data, switch
to other tasks
4
MIMD: Centralized Shared Memory
• Advantages: More memory bandwidth (low cost), lower memory latency (local)
• Drawback: Longer communication latency, More complex software model
5
Distributed Memory Versions
• Programming Model:
– Multiprogramming : lots of jobs, no communications
– Shared memory: communicate via memory
– Message passing: send and receive messages
– Data Parallel: several agents operate on several data sets
simultaneously and then exchange information globally and
simultaneously (shared or message passing)
• Communication Abstraction:
– Shared address space, e.g., load, store, atomic swap
– Message passing, e.g., send, receive library calls
– Debate over this topic (ease of programming, scaling) => many
hardware designs 1:1 programming model
6
Data Parallel Model
Communication Mechanisms
7
Shared Memory Model
8
Advantages of Shared-Memory Communication
• Compatibility
– Oldest, but the most popular model
• Ease of programming
– Communication patterns among processors are complex and
dynamic
– Simplify compilers
• Use familiar shared-memory model to develop applications
– Attention only on performance critical accesses
• Lower overhead for communications and better use of bandwidth
when communicating small items
• Can use hardware-controlled caching to reduce the frequency of
remote communication
• Hardware is simpler
• Communication is explicit
– It is easier to understand (when, cost, etc.)
– In shared memory it can be hard to know when communicating and
when not, and how costly it is
• Force programmer to focus on communication, the costly aspects
of parallel computing
• Synchronization is naturally associated with sending messages
– Reducing the possibility for errors introduced by incorrect
synchronization
• Easier to use sender-initiated communication
– May have some advantages in performance
9
Communication Options
• Amdahl’s Law:
1
Speedup =
FracX
( + (1 − FracX ))
SpeedupX
• The sequential portion limits parallel speedup
– Speedup <= 1/ (1 - FracX)
• The large latency of remote access is another major challenge
Example: What fraction sequential to get 80x speedup from 100 processors?
Assume either 1 processor or 100 fully used.
80 = 1 / ((FracX/100 + (1-FracX))
0.8 · FracX + 80 · (1 - FracX) = 80 - 79.2 · FracX = 1
FracX = (80-1) / 79.2 = 0.9975
Only 0.25% sequential!
10
But In The Real World...
11
Multiprocessor Coherence Rules
12
Bus Snooping Topology
• Memory: centralized with uniform access time (“UMA”) and bus interconnect
• Symmetric Multiprocessor (SMP)
13
Basic Snooping Protocols
14
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests CPU Read Shared
for each Invalid (read/only)
cache block Place read miss
on bus
CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
Write miss
for this block Read miss
Write Back Read miss for this block
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)
15
Example
Processor 1 Processor 2 Bus Memory
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 1
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
16
Example: Step 2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 3
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 10
P2: Write 40 to A2 10
10
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
17
Example: Step 4
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 10
10
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 5
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
18
False Sharing
P1 P2
for( i=0; ; i++ ) { for( j=0; ; j++ ) {
... ...
*a = i;
*(a + 1) = j;
...
}
...
}
state
cache block
• x1 and x2 share a cache block, which starts in the shared state. What
happens next?
1 Write x1 ?
2 Read x2
3 Write x1
4 Write x2
5 Read x2
19
False Sharing Quiz
• x1 and x2 share a cache block, which starts in the shared state. What
happens next?
1 Write x1 True
2 Read x2 ?
3 Write x1
4 Write x2
5 Read x2
1 Write x1 True
2 Read x2 False
3 Write x1 ?
4 Write x2
5 Read x2
20
False Sharing Quiz
1 Write x1 True
2 Read x2 False
3 Write x1 False
4 Write x2 ?
5 Read x2
1 Write x1 True
2 Read x2 False
3 Write x1 False
4 Write x2 False
5 Read x2 ?
21
False Sharing Quiz
1 Write x1 True
2 Read x2 False
3 Write x1 False
4 Write x2 False
5 Read x2 True
Larger MPs
22
Distributed Directory MPs
Directory Protocol
23
Directory-based cache coherence
• State machine
for CPU requests Invalidate
for each Shared
memory block Invalid (read/only)
CPU Read
• Invalid state Send Read Miss
if in message CPU read miss:
memory
CPU Write: Send Read Miss
Send Write Miss CPU Write:Send
msg to h.d. Write Miss message
Fetch/Invalidate to home directory
send Data Write Back message
to home directory Fetch: send Data Write Back
message to home directory
CPU read miss: send Data
Write Back message and
Exclusive read miss to home directory
(read/writ)
CPU read hit CPU write miss:
CPU write hit send Data Write Back message
and Write Miss to home
directory
24
Directory State Machine
Read miss:
Sharers += {P};
• State machine Read miss: send Data Value Reply
for Directory Sharers = {P}
requests for each send Data Value
Reply Shared
memory block Invalid (read only)
Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Reply msg
Read miss:
Write Miss: Sharers += {P};
Sharers = {P}; Exclusive send Fetch;
send Fetch/Invalidate; (read/write) send Data Value Reply
send Data Value Reply msg to remote cache
msg to remote cache (Write back block)
Example
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
25
Example
P2: Write 20 to A1
P2: Write 40 to A2
Example
P2: Write 20 to A1
P2: Write 40 to A2
26
Example
Write
WriteBack
Back
A1 and A2 map to the same cache block
Example
27
Example
28
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests CPU Read Shared
for each Invalid (read/only)
cache block Place read miss
on bus
CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
29
Memory Consistency
30
Memory Consistency Models
Sequential Consistency
• Implementations
– Delay memory access completion until all invalidates are done
– Or, delay next memory access until previous one is done
31
How uniprocessor hardware optimizations of memory system
break sequential consistency + solution {No caches}
1. P1: Flag1=1
P1 P2 2. P2: Flag2=1
1
3. P1: if (Flag2==0)
Flag1=1 4. P2: if (Flag1==0)
bus
Flag1:0
Flag2:0
32
Animations ☺
1. P1: Flag1=1
P1 P2 2. P2: Flag2=1
1 2
3. P1: if (Flag2==0)
Flag1=1 Flag2=1 4. P2: if (Flag1==0)
bus
Flag1:0
Flag2:0
Animations ☺
1. P1: Flag1=1
P1 P2
1 2
2. P2: Flag2=1
3 3. P1: if (Flag2==0) true!
Flag1=1 Flag2=1 4. P2: if (Flag1==0)
bus
Flag1:0
Flag2:0
33
Animations ☺
1. P1: Flag1=1
P1 P2
1 2
2. P2: Flag2=1
3 4
3. P1: if (Flag2==0) true!
Flag1=1 Flag2=1 4. P2: if (Flag1==0) true!
bus
Flag1:0
Flag2:0
• Solution:
– Wait till write operation has reached memory module before
allowing next write – acknowledgment for writes – more delay
34
Animations of writes bypassing writes
1. P1: Data=2000
P1
1 2. P1: Head=1
3. P2: while (Head==0);
4. P2: … = Data
P2
Data: 0
M3: Head: 0 M4:
1. P1: Data=2000
P1
1 2. P1: Head=1
3. P2: while (Head==0);
2 2
4. P2: … = Data
P2
Data: 0
M3: Head: 0 M4:
35
Animations of writes bypassing writes
1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0); false
2 2 4. P2: … = Data
P2 3
Data: 0
M3: Head: 0 M4:
1. P1: Data=2000
P1
1 2. P1: Head=1
3. P2: while (Head==0); true
2 2
3 4. P2: … = Data Data=0!
P2
4
Data: 0
M3: Head: 0 M4:
36
How uniprocessor optimizations of memory system break
sequential consistency {No caches}
• Solution:
– Disable non-blocking reads – more delay
1. P1: Data=2000
P1 2. P1: Head=1
1 3. P2: while (Head==0);
4. P2: … = Data
P2
Data: 0
M3: Head: 0 M4:
37
Animations of reads bypassing reads
1. P1: Data=2000
P1 2. P1: Head=1
1 3. P2: while (Head==0);
2 4. P2: … = Data Data=0
2 2 P2
Data: 0
M3: Head: 0 M4:
1. P1: Data=2000
P1 2. P1: Head=1
1 3. P2: while (Head==0); Ld miss:
Head=0
2 2 2
P2 4. P2: … = Data Data=0
3
Data: 0
M3: Head: 0 M4:
38
Now, let’s add caches.. More problems
• (WAW) P1’s write to Data reaches memory, but has not yet been
propagated to P2 (cache coherence protocol’s message (inv. or
update) has not yet arrived)
• Code Example:
1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0);
P2 4. P2: … = Data
Data=0
Data: 2000
M3: Head: 0 M4:
39
Animations of problems with a cache
1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0);
P2 2 4. P2: … = Data
2
Data=0
Data: 2000
M3: Head: 1 M4:
1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0); false
P2 2 2 4. P2: … = Data
Data=0
3
Data: 2000
M3: Head: 1 M4:
40
Animations of problems with a cache
1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0);
P2 2 4. P2: … = Data Data=0!
2
Data=0
3
Data: 2000
M3: Head: 1 M4:
41
Animations of problems with networks
A=2
A=1 1 2
M1 P1 P2
A
P3 M2 P4
B,C
A=2
A=1 1 2
M1 P1 P2
A
P3 M2 P4
B,C
Register1=A=1
42
Animations of problems with networks
A=2
A=1 1 2
M1 P1 P2
A
P3 M2 P4
B,C
Register1=A=1 Register2=A=2
• Slow, because
– In each processor, previous memory operation in program order
must be completed before proceeding with next memory operation
– In cache-coherent systems,
• writes to the same location must be visible in the same order to all
processors; and
• no subsequent reads till write is visible to all processors.
• Mitigating solutions
– Prefetch cache ownership while waiting in write buffer
– Speculate and rollback
– Compiler memory analysis – reorder at hw/sw only if safe
43
Relaxed (Weak) Consistency Models
Synchronization
44
Synchronization
45
Load Linked, Store Conditional
• Hard to have read and write in one instruction; use two instead
• Load linked (or load locked) + store conditional
– Load linked returns the initial value
– Store conditional returns 1 if it succeeds (no other store to same memory
location since preceeding load) and 0 otherwise
• Example doing atomic swap with LL & SC:
try: mov R3,R4 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4
• Example doing fetch & increment with LL & SC:
try: ll R2,0(R1) ; load linked
addi R2,R2,#1 ; increment (OK if reg–reg)
sc R2,0(R1) ; store conditional
beqz R2,try ; branch store fails (R2 = 0)
User-Level Synchronization
46
Scalable Synchronization
• Exponential backoff
– if you fail to get the lock, don’t try again for an exponentially
increasing amount of time
• Queuing lock
– Wait for lock by getting on the queue. Lock is passed down the
queue.
– Can be implemented in hardware or software
• Combining tree
– Implement barriers as a tree to avoid contention
• Powerful atomic primitives such as fetch-and-increment
Summary
47
IBM Power4
48
Interconnection network of buses
• L1 cache: Writethrough
• L2 cache: Maintains coherence (invalidation-based), Writeback,
Inclusive of L1
• L3 cache: Writeback, non-inclusive, maintains coherence
(invalidation-based)
• Snooping protocol
– Snoops on all buses.
49
IBM Power4’s L2 coherence protocol
• I (Invalid state)
• SL (Shared state, in one core within the module)
• S (Shared state, in multiple cores within a module)
• M (Modified state, and exclusive)
• Me (Data is not modified, but exclusive, no write back
necessary)
• Mu [Load-linked lwarx, Store-conditional swarx]
• T (tagged: data modified, but not exclusively owned)
– Optimization: Do not write back modified exclusive data upon a
read hit
50
IBM Power4’s L3 coherence protocol
• States:
• Invalid
• Shared: Can only give data to its own L2s that it’s caching data
for
• Tagged: Modified and shared (local memory)
• Tagged remote: Modified and shared (remote memory)
• O (prefetch): Data in L3 = Memory, and same node, not
modified.
51
Node overview
Single chip
52