0% found this document useful (0 votes)

95 views52 pages

Memory in Multiprocessor System

This document discusses parallel computing and multiprocessors. It covers topics such as: - Definitions of parallel computers and questions about their design. - Different levels of parallelism including bit-level, instruction-level, and process-level parallelism. - Reasons for using multiprocessors like leveraging microprocessors and addressing limits of uniprocessor performance. - Popular models of parallelism including SISD, SIMD, MISD, and MIMD and examples of each. - Communication mechanisms in parallel systems like shared memory, message passing, and data parallel programming.

Uploaded by

Priyanka Madaan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views52 pages

Memory in Multiprocessor System

Uploaded by

Priyanka Madaan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Multiprocessors: Basics, Cache coherence,

Synchronization, and Memory consistency

Z. Jerry Shi
Assistant Professor of Computer Science and Engineering
University of Connecticut

* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*

Parallel Computers

• Definition: “A parallel computer is a collection of processing elements that

cooperate and communicate to solve large problems fast.”
– Almasi and Gottlieb, Highly Parallel Computing ,1989
• Questions about parallel computers:
– How large a collection?
– How powerful are processing elements?
– How do they cooperate and communicate?
– How are data transmitted?
– What type of interconnection?
– What are HW and SW primitives for programmer?
– Does it translate into performance?

1
Parallel Processors “Religion”

• The dream of computer architects since 1950s: replicate

processors to add performance vs. design a faster processor
• Led to innovative organization tied to particular programming
models since “uniprocessors can’t keep going”
– e.g., uniprocessors must stop getting faster due to limit of speed of
light: 1972, … , 1989, NOW
– Borders religious fervor: you must believe!
– Fervor damped somewhat when 1990s companies went out of
business: Thinking Machines, Kendall Square, ...
• Arguments: “pull” of opportunity of scalable performance; the
“push” of uniprocessor performance plateau?

What Level Parallelism?

• Bit level parallelism: 1970 to ~1985

– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level parallelism (ILP):
~1985 through today
– Pipelining
– Superscalar
– VLIW
– Out-of-Order execution
– Limits to benefits of ILP?
• Process Level or Thread level parallelism; mainstream for general purpose
computing?
– Servers are parallel
– Many PCs too!

2
Why Multiprocessors?

• Microprocessors are the fastest CPUs – leverage them

– Collecting several much easier than redesigning one
• Complexity of current microprocessors
– Do we have enough ideas to sustain 1.5X/yr?
– Can we deliver such complexity on schedule?
• Slow (but steady) improvement in parallel software (scientific
apps, databases, OS)
• Emergence of embedded and server markets driving
microprocessors in addition to desktops
– Embedded functional parallelism, producer/consumer model
– Server figure of merit is tasks per hour vs. latency
• Long term goal: scale number of processors to size of budget and
desired performance

Parallel Architecture

• Parallel Architecture extends traditional computer architecture

with a communication architecture
– Abstractions (HW/SW interface)
– Organizational structure to realize abstraction efficiently

3
Performance Metrics: Latency and Bandwidth

• Bandwidth
– Need high bandwidth in communication
– Match limits in network, memory, and processor
– Challenge is link speed of network interface vs. bisection bandwidth of
network
• Latency
– Affects performance, since processor may have to wait
– Affects ease of programming, since it requires more thought to overlap
communication and computation
– Overhead to communicate is a problem in many machines
• Latency Hiding
– How can a mechanism help hide latency?
– Increases programming system burden
– Examples: overlap message send with computation, prefetch data, switch
to other tasks

Popular Flynn Categories (1966)

• SISD (Single Instruction Stream, Single Data Stream)

– Uniprocessors
• SIMD (Single Instruction Stream, Multiple Data Stream)
– Sing instruction memory and control processor (e.g., Illiac-IV, CM-2,
vector machines)
• Simple programming model
• Low overhead
• Requires lots of data parallelism
• All custom integrated circuits
• MISD (Multiple Instruction Stream, Single Data Stream)
– No commercial multiprocessor of this type has been built
• MIMD (Multiple Instruction Stream, Multiple Data Stream)
– Examples: IBM SP, Cray T3D, SGI Origin, your PC?
• Flexible
• Use off-the-shelf microprocessors
• MIMD is the current winner: Concentrate on major design emphasis <= 128
processor MIMD machines

4
MIMD: Centralized Shared Memory

• UMA: Uniform Memory Access (latency)

• Symmetric (shared-memory) Multi-Processor (SMP)

MIMD: Distributed Memory

• Advantages: More memory bandwidth (low cost), lower memory latency (local)
• Drawback: Longer communication latency, More complex software model

5
Distributed Memory Versions

• Distributed Shared Memory (DSM)

– Single physical address space with load/store access
• Address space is shared among nodes
– Also called Non-Uniform Memory Access (NUMA)
• Access time depends on the location of the data word in memory
• Message-Passing Multicomputer
– Separate address space per processor
• The same address on different nodes refers to different words
– Could be completely separated computers connected on LAN
• Also called clusters
– Tightly-coupled cluster

Layers of Parallel Framework

• Programming Model:
– Multiprogramming : lots of jobs, no communications
– Shared memory: communicate via memory
– Message passing: send and receive messages
– Data Parallel: several agents operate on several data sets
simultaneously and then exchange information globally and
simultaneously (shared or message passing)
• Communication Abstraction:
– Shared address space, e.g., load, store, atomic swap
– Message passing, e.g., send, receive library calls
– Debate over this topic (ease of programming, scaling) => many
hardware designs 1:1 programming model

6
Data Parallel Model

• Operations can be performed in parallel on each element of a

large regular data structure, such as an array
• 1 Control Processor broadcast to many PEs
– When computers were large, could amortize the control portion of
many replicated PEs
• Condition flag per PE so that PE can be skipped
• Data distributed in each memory
• Early 1980s VLSI => SIMD rebirth: 32 1-bit procs + memory on
a chip was the PE
• Data parallel programming languages lay out data to processors

Communication Mechanisms

• Shared memory: Memory load and store

– Oldest, but most popular model
• Message passing: Send and receive messages
– Sending messages that request action or deliver data
– Remote procedure call (RPC)
• Request the data and/or operations performed on data
• Synchronous: Request, wait, and continue
• Three performance metrics
– Communication bandwidth
– Communication latency
– Communication latency hiding
• Overlap communication with computation or with other
communication

7
Shared Memory Model

• Communicate via Load and Store

• Based on timesharing: processes on multiple processors vs.
sharing single processor
• Process: a virtual address space and about one thread of control
– Multiple processes can overlap (share), but ALL threads share a
process address space
– Threads may be used in a casual way to refer to multiple loci of
execution, even when they do not share an address space
• Writes to shared address space by one thread are visible to reads
of other threads
– Usual model: share code, private stack, some shared heap, some
private heap

Message Passing Model

• Whole computers (CPU, memory, I/O devices) communicate as

explicit I/O operations
• Send specifies local buffer + receiving process on remote
computer
• Receive specifies sending process on remote computer + local
buffer to place data
– Usually send includes process tag and receive has rule on tag (i.e.
match 1, match any)
– Synchronize: When send completes? When buffer free? When
request accepted? Receive wait for send?
• Send+receive => memory-memory copy, where each supplies
local address, AND does pairwise sychronization!

8
Advantages of Shared-Memory Communication

• Compatibility
– Oldest, but the most popular model
• Ease of programming
– Communication patterns among processors are complex and
dynamic
– Simplify compilers
• Use familiar shared-memory model to develop applications
– Attention only on performance critical accesses
• Lower overhead for communications and better use of bandwidth
when communicating small items
• Can use hardware-controlled caching to reduce the frequency of
remote communication

Advantages of Message-Passing Communication

• Hardware is simpler
• Communication is explicit
– It is easier to understand (when, cost, etc.)
– In shared memory it can be hard to know when communicating and
when not, and how costly it is
• Force programmer to focus on communication, the costly aspects
of parallel computing
• Synchronization is naturally associated with sending messages
– Reducing the possibility for errors introduced by incorrect
synchronization
• Easier to use sender-initiated communication
– May have some advantages in performance

9
Communication Options

• Supporting message passing on top of shared memory

– Sending a message == Copying data
• Data may be misaligned though
• Supporting shared memory on top of hardware for message
passing
– The cost of communication is too expensive
• Loads/stores usually move small amounts of data
– Shared virtual memory may be a possible direction
• What’s the best choice for massively parallel processors (MPP)?
– Message passing is clearly simpler
– Shared-memory has been widely supported since 1995

Amdahl’s Law and Parallel Computers

• Amdahl’s Law:
1
Speedup =
FracX
( + (1 − FracX ))
SpeedupX
• The sequential portion limits parallel speedup
– Speedup <= 1/ (1 - FracX)
• The large latency of remote access is another major challenge

Example: What fraction sequential to get 80x speedup from 100 processors?
Assume either 1 processor or 100 fully used.
80 = 1 / ((FracX/100 + (1-FracX))
0.8 · FracX + 80 · (1 - FracX) = 80 - 79.2 · FracX = 1
FracX = (80-1) / 79.2 = 0.9975
Only 0.25% sequential!

10
But In The Real World...

• Wood and Hill [1995] compared cost-effectiveness of

multiprocessor systems
– Challenge DM (1 proc, 6GB memory)
• costDM(1, MB) = $38,400 + $100 × MB
• costDM(1, 1GB) = $140,800
– Challenge XL (32 proc)
• costXL(p, MB) = $81,600 + $20,000 × p + $100 × MB
• costXL(8, 1GB) = $344,000
= 2.5x costDM(1,1GB)
– So, XL speedup of 2.5x or more makes the XL more cost effective!

Memory System Coherence

• Cache coherence challenge for hardware

• Memory system is coherent if any read of a data item returns the
most recently written value of that data item
– Coherence: What value can be returned by a read
• Defines the behavior of reads and writes with respect to the same
memory location
• Ensures the memory system is correct.
– Consistency: When written values will be returned by reads (all
locations)
• Defines the behavior of accesses with respect to accesses to other
memory locations

11
Multiprocessor Coherence Rules

• A memory system is coherent if

• Processor P writes X
... No other writes to X
Processor P reads X
Value must be the same
• Processor Q writes X
... Sufficient time
... No other writes to X
Processor P reads X
Value must be the same
• Processor P writes X
Processor Q writes X
All processors see it as P, Q or Q, P, but the order should be the same for
all processors
Wirtes to the same location are serialized

Potential HW Coherency Solutions

• Snooping Solution (Snoopy Bus):

– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
• Directory-Based Schemes (discussed later)
– Keep track of what is being shared in a centralized place (logically)
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes

12
Bus Snooping Topology

• Memory: centralized with uniform access time (“UMA”) and bus interconnect
• Symmetric Multiprocessor (SMP)

Basic Snooping Protocols

• Write Invalidate Protocol:

– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which
snoop and invalidate any copies
– Read Miss:
• Write-through: memory is always up-to-date
• Write-back: snoop in caches to find most recent copy
• Write Broadcast/Update Protocol (typically write through):
– Write to shared data: broadcast on bus, processors snoop, and
update any copies
– Read miss: memory is always up-to-date
• Write serialization: bus serializes requests!
– Bus is single point of arbitration

13
Basic Snooping Protocols

• Write Invalidate versus Broadcast:

– Invalidate requires one transaction per write-run
– Invalidate uses spatial locality
• Invalidate a block each time
– Broadcast has lower latency between write and read

An Example Snooping Protocol

• Invalidation protocol, write-back cache

• Each block of memory is in one state:
– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches (Invalid)
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data
• Read misses: cause all caches to snoop bus
• Writes to clean line are treated as misses

14
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests CPU Read Shared
for each Invalid (read/only)
cache block Place read miss
on bus

CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus

Snoopy-Cache State Machine-II

• State machine Write miss

for bus requests Invalid for this block Shared
for each (read/only)
cache block

Write miss
for this block Read miss
Write Back Read miss for this block
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)

15
Example
Processor 1 Processor 2 Bus Memory
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit

is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus