Parallel Architecture Fundamental
Parallel Architecture Fundamental
1e+7 1e+7
52
1e+6 52%
1e+6
%/ Perf (ps/Inst) /ye Perf (ps/Inst)
ye
1e+5
ar Delay/CPUs
1e+5 ps
/g ar 1e+4
a
Ga te 1e+3
19%/ye
1e+4 te
s /c
19
% 1e+2
74 30:1 ar
Clo
ck loc
k9 %/
1e+3 s/
ins % 19%
1e+1
yea
t1 /yea 1e+0 r 1,000:1
1e+2 8% r 1e-1 30,000:1
1e-2
1e+1
1e-3
1e+0 1e-4
1980 1990 2000 2010 2020 1980 1990 2000 2010 2020
Bill Dally Bill Dally
–5– CS 740 F’03 –6– CS 740 F’03
Application demands: Our insatiable need for cycles Demand for cycles fuels advances in hardware, and vice-versa
• Scientific computing: CFD, Biology, Chemistry, Physics, ... • Cycle drives exponential increase in microprocessor performance
• General- purpose computing: Video, Graphics, CAD, Databases, TP... • Drives parallel architecture harder: most demanding applications
Technology Trends Range of performance demands
• Number of transistors on chip growing rapidly • Need range of system performance with progressively increasing cost
• Clock rates expected to go up only slowly • Platform pyramid
Economics For a fixed problem size (input data set), performance = 1/time
Mainframes
Microprocessors • Helps build intuition about design issues or parallel machines
Minicomputers • Shows fundamental role of parallelism even in “sequential” computers
1
Four generations of architectural history: tube,
transistor, IC, VLSI
• Here focus only on VLSI generation
0.1
1965 1970 1975 1980 1985 1990 1995 Greatest delineation in VLSI has been in type of
Commodity microprocessors have caught up with supercomputers. parallelism exploited
– 15 – CS 740 F’03 – 16 – CS 740 F’03
Arch. Trends: Exploiting Parallelism Phases in VLSI Generation
Bit-level parallelism Instruction-level Thread-level (?)
parallelism
• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
10,000,000
R10000
– great inflection point when 32-bit micro and cache fit on a chip
Transistors
i80386
i80286 R3000
• Mid 80s to mid 90s: instruction level parallelism
100,000
R2000
Speedup
15 1.5
• Murakami et al. [1989] ........................................ 2.55
• Chang et al. [1991] ............................................. 2.90 10 1
– 23 – CS 740 F’03
History Today
Historically, parallel architectures tied to programming models Extension of “computer architecture” to support
• Divergent architectures, with no predictable pattern of growth. communication and cooperation
• OLD: Instruction Set Architecture
• NEW: Communication Architecture
Application Software
Defines
System • Critical abstractions, boundaries, and primitives (interfaces)
Systolic Software • Organizational structures that implement interfaces (hw or sw)
Arrays SIMD
Architecture
Message Passing Compilers, libraries and OS are important bridges
Dataflow today
Shared Memory
Historically, machines tailored to programming models Any processor can directly reference any memory
• Programming model, communication abstraction, and machine location
organization lumped together as the “architecture” • Communication occurs implicitly as result of loads and stores
Evolution helps understand convergence Convenient:
• Identify core concepts • Location transparency
• Similar programming model to time-sharing on uniprocessors
Most Common Models:
– Except processes run on different processors
• Shared Address Space, Message Passing, Data Parallel
– Good throughput on multiprogrammed workloads
Other Models: Naturally provided on wide range of platforms
• Dataflow, Systolic Arrays • History dates at least to precursors of mainframes in early 60s
• Wide range of scale: few to hundreds of processors
Examine programming model, motivation, intended
applications, and contributions to convergence Popularly known as shared memory machines or model
• Ambiguous: memory may be physically distributed among processors
Common physical
P2 addresses
P1
P0
Mem Mem Mem Mem I/O ctrl I/O ctrl
St ore
P2 pr i vate
Shared portion
of address space Interconnect Interconnect
P1 pri vat e
Private portion
of address space
P0 pri vat e
Processor Processor
• Originally processor cost limited to small scale I/O C P-Pro bus (64-bit data, 36-bit address, 66 MHz)
PCI bus
PCI
PCI bus
I/O MIU
cards
1-, 2-, or 4-way
interleaved
Bus interface/switch
$ °°° $ M $ M $ °°° M $
$
Gigaplane bus (256 data, 41 address, 83 MHz)
P P P P P P
I/O cards
Bus interface
“Dance hall” Distributed memory
2 FiberChannel
• Problem is interconnect: cost (crossbar) or bandwidth (bus)
100bT, SCSI
• Dance-hall: bandwidth still scalable, but lower cost than crossbar
SBUS
SBUS
SBUS
– latencies to memory uniform, but uniformly large
• Distributed memory or non-uniform memory access (NUMA)
• 16 cards of either type: processors + memory, or I/O – Construct shared address space out of simple message transactions
• All memory accessed over bus, so symmetric across a general-purpose network (e.g. read-request, read-response)
• Higher bandwidth, higher latency bus • Caching shared (particularly nonlocal) data?
– 37 – CS 740 F’03 – 38 – CS 740 F’03
Match ReceiveY, P, t
• Hardware close to programming model
Address Y
SendX, Q, t – synchronous ops 001 000
Address X
• Replaced by DMA, enabling non-blocking ops
– Buffered by system at destination until recv
111 110
Local process Local process
address space
address space
Power 2
CPU IBM SP-2 node
I/O DMA
DRAM
8 bits,
i860 NI 175 MHz,
2D grid network bidirectional
with processing node
attached to every switch
Original motivation:
• Nodes connected by general network and communication assists • Matches simple differential equation solvers
PE PE °°° PE
• Implementations also converging, at least in high-end machines • Centralize high cost of instruction fetch & PE PE °°° PE
PE PE PE
°°°
– 45 – CS 740 F’03 – 46 – CS 740 F’03
Waiting
Matching
Instruction
fetch
Execute Form
token
Network
• Integration of communication with thread (handler) generation
Token queue • Tightly integrated communication and fine-grained synchronization
Network
• Remained useful concept for software (compilers etc.)
– 49 – CS 740 F’03 – 50 – CS 740 F’03
°°°
Fundamental Design
Communication
Mem assist (CA)
Issues
$
– 53 – CS 740 F’03
Traditional taxonomies not very useful At any layer, interface (contract) aspect and performance aspects
Programming models not enough, nor hardware • Naming: How are logically shared data and/or processes referenced?
structures • Operations: What operations are provided on these data
• Same one can be supported by radically different architectures
• Ordering: How are accesses to data ordered and coordinated?
Architectural distinctions that affect software
• Replication: How are data replicated to reduce communication?
• Compilers, libraries, programs
Design of user/system and hardware/software interface • Communication Cost: Latency, bandwidth, overhead, occupancy
• Constrained from above by progr. models and below by technology Understand at programming model first, since that sets requirements
Guiding principles provided by layers
• What primitives are provided at communication abstraction
Other issues:
• How programming models map to these • Node Granularity: How to split between processors and memory?
• How they are mapped to hardware • ...
Mutual exclusion (locks) Naming: Processes can name private data directly.
• Ensure certain operations on certain data can be performed by • No shared address space
only one process at a time Operations: Explicit communication via send and receive
• Room that only one person can enter at a time • Send transfers data from private address space to another process
• No ordering guarantees
• Receive copies data from process to private address space
• Must be able to name processes
Event synchronization
• Ordering of events to preserve dependences
Ordering:
– e.g. producer —> consumer of data • Program order within a process
• 3 main types: • Send and receive can provide pt-to-pt synch between processes
– point-to-point • Mutual exclusion inherent
– global Can construct global address space:
– group • Process number + address within process address space
• But no direct operations on these names