Lect11 12 Parallel
Lect11 12 Parallel
COMPUTATION
1. Why Parallel Computation
2. Parallel Programs
3. A Classification of Computer Architectures
4. Performance of Parallel Architectures
5. The Interconnection Network
6. SIMD Computers: Array Processors
7. MIMD Computers
9. Multicore Architectures
10. Multithreading
11. General Purpose Graphic Processing Units
12. Vector Processors
13. Multimedia Extensions to Microprocessors
The Need for High Performance
Two main factors contribute to high performance of modern processors:
Fast circuit technology
Architectural features:
- large caches
- multiple fast buses
- pipelining
- superscalar architectures (multiple functional units)
- Very Large Instruction word architectures
However
Computers running with a single CPU, often are not able to meet
performance needs in certain areas:
- Fluid flow analysis and aerodynamics
- Simulation of large complex systems, for example in physics,
economy, biology, technic
- Computer aided design
- Multimedia
- Machine learning
A Solution: Parallel Computers
Such computers have been organized in different ways. Some key features:
number and complexity of individual CPUs
availability of common (shared memory)
interconnection topology
performance of interconnection network
I/O devices
-------------
CPU
data stream
Memory
Flynn’s Classification of Computer Architectures
Single Instruction stream, Multiple Data stream (SIMD)
Processing DS1
unit_1
Interconnection Network
Processing DS2
unit_2
Control IS Shared
unit Memory
Processing DSn
unit_n
Flynn’s Classification of Computer Architectures
Single Instruction stream, Multiple Data stream (SIMD)
LM2
DS2
Interconnection Network
LM Processing
unit_2
Control IS
unit
LMn
DSn
Processing
unit_n
Flynn’s Classification of Computer Architectures
Multiple Instruction stream, Multiple Data stream (MIMD)
Interconnection Network
CPU_2 DS2
Control Processing
unit_2 unit_2
Shared
Memory
ISn
LMn
CPU_n DSn
Control Processing
unit_n unit_n
19 of 82
Flynn’s Classification of Computer Architectures
Multiple Instruction stream, Multiple Data stream (MIMD)
Interconnection Network
CPU_2 DS2
Control Processing
unit_2 unit_2
ISn
LMn
CPU_n DSn
Control Processing
unit_n unit_n
Performance of Parallel Architectures
Important questions:
Peak rate: the maximal computation rate that can be theoretically achieved
when all modules are fully utilized.
The peak rate is of no practical significance for the user. It is mostly used by
vendor companies for marketing of their computers.
TS
S = ------
TP
S
E = --
p
S: speedup;
p: number of processors.
S
1 – f TS
TP = f TS + --------------------------- 10
p 9
8
7
6
TS 1 5
S = = 4
TS 1 – f 3
f TS + 1 – f ----- f + ---------------
p
p 2
1
0.2 0.4 0.6 0.8 1.0 f
Amdahl’s Law
‘
Amdahl’s law: even a little ratio of sequential computation imposes a certain
limit to speedup; a higher speedup than 1/f can not be achieved, regardless the
number of processors.
S 1
E = =
P f p – 1 + 1
Beside the intrinsic sequentiality of some parts of an algorithm there are also
other factors that limit the achievable speedup:
communication cost
load balancing of processors
costs of creating and scheduling processes
I/O operations
There are many algorithms with a high degree of parallelism; for such
algorithms the value of f is very small and can be ignored. These algorithms
are suited for massively parallel systems; in such cases the other limiting
factors, like the cost of communications, become critical.
The Interconnection Network
The traffic in the IN consists of data transfer and transfer of commands and
requests.
Single Bus
Node1
Node2 Node5
Node3 Node4
Node1
Node2
Noden
Mesh networks are cheaper than completely connected ones and provide
relatively good performance.
In order to transmit an information between certain nodes, routing through
intermediate nodes is needed (max. 2*(n-1) intermediates for an n*n mesh).
It is possible to provide wraparound connections: between nodes 1 and 13, 2
and 14, etc.
Three dimensional meshes have been also implemented.
The Interconnection Network
Hypercube network N0 N4
N8 N12
N1 N5
N9 N13
N10 N14
N2 N6
N11 N15
N3 N7
PU PU PU
Control
unit
PU PU PU
SIMD computers are usually called array processors.
PU’s are very simple: an ALU which executes the instruction broadcast by
the CU, a few registers, and some local memory.
The first SIMD computer: ILLIAC IV (1970s), 64 relatively powerful
processors (mesh connection, see above).
Newer SIMD computer: CM-2 (Connection Machine, by Thinking Machines
Corporation, 65 536 very simple processors (connected as hypercube).
Array processors are specialized for numerical problems formulated as
matrix or vector calculations. Each PU computes one element of the result.
MIMD computers
MIMD with shared memory
Classical parallel mainframe
Local Local Local computers (1970-1980-1990):
Memory Memory Memory
IBM 370/390 Series
Processor Processor Processor CRAY X-MP, CRAY Y-MP, CRAY 3
1 2 n
Modern multicore chips:
Intel Core Duo, i5, i7; Arm MPC
Interconnection Network
Shared
Memory
Interconnection Network
Examples:
Intel x86 Multicore architectures
- Intel Core Duo
- Intel Core i7
ARM11 MPCore
Intel Core Duo
Composed of two Intel Core superscalar processors
Processor Processor
core core
Off chip
Main Memory
Intel Core i7
Contains four Nehalem processors.
Off chip
Main Memory
ARM11 MPCore
Off chip
Main Memory
Multithreading
A running program:
one or several processes; each process:
- one or several threads
process 1
thread 1_1 process 2 process 3
thread 1_3
thread 2_1
thread 2_2
thread 3_1
thread 1_2
processor 1 processor 2
Blocked multithreading:
Instructions of the same thread are executed until the thread is blocked;
blocking of a thread triggers a switch to another thread ready to execute.
no multithreading
A
A thread blocked
A A
B B
C B
D B
B C
C C
D D
A A
Approaches to Multithreaded
Superscalar processors
interleaved
no multithreading multithreading
X XYZ V
X X X X
X X X Y Y Y Y
Z Z Z
blocked simultaneous
X X X X X X V V V V
multithreading X X X X X X X X X multithreading
XYZ V X X X Y Y XYZ V
Z Z Z Z
X X X X X X X
X X X X Y Y Z
X X X Z Z X V V
Y Y Y Y X X X Y Y X
Y Y Y Z Z Z X X X
Y Y V V X X Z Z
Z Z Z Z Z Z Y Y V V V C
Z Z X X X Y V Y
V V V V Z Z Y V V
Multithreaded Processors
Are multithreaded processors parallel computers?
Yes:
they execute parallel threads;
certain sections of the processor are available in several copies (e.g.
program counter, instruction registers + other registers);
the processor appears to the operating system as several processors.
No:
only certain sections of the processor are available in several copies
but we do not have several processors; the execution resources (e.g.
functional units) are common.
NVIDIA, AMD, etc. have introduced high performance GPUs that can be used
for general purpose high performance computing: general purpose graphic
processing units (GPGPUs).
Memory
SP SP SP SP
SP SP SP SP
Memory
SP SP SP SP
Each TPC consists of: SFU SFU SFU SFU
two SMs, controlled by the SM Shared Shared
controller (SMC); memory memory
Common CPU based parallel computers are primarily optimised for latency:
each thread runs as fast as possible, but only a limited amount of
threads are active.
Vector computers usually have vector registers which can store each 64 up
to 128 words.
Vector instructions:
load vector from memory into vector register
store vector into memory
arithmetic and logic operations between vectors
operations between vectors and scalars
etc.
vector registers
Scalar unit
Vector registers:
Scalar Scalar registers n general purpose vector
instructions
Scalar functional registers Ri, 0 i n-1,
units each of length s;
Instruction vector length register VL:
Memory
decoder stores the length l (0 l s)
Vector registers of the currently processed
vectors;
Vector Vector functional
instructions units mask register M; stores a
Vector unit set of l bits, interpreted as
boolean values; vector
instructions can be
executed in masked mode:
vector register elements
corresponding to a false
value in M, are ignored. 71 of 82
Vector Instructions
R0 R0 R1 M VL
9 8 1 1 10
2 5 -3 1
0 0 2 0
7 2 5 1
3 3 11 0
-4 -4 7 0
4 1 3 1
21 12 + 9 1
7 7 0 0
13 9 4 1
x x x x
x x x x
x x x x
x x x x
x x x x
x x x x
Vector Instructions
R0 T(0:49:1)
VL 50
M R0 > 0
WHERE(M) R0 R0 + 1
T(0:49:1) R0
Multimedia Extensions to General Purpose
Microprocessors
Video and audio applications very often deal with large arrays of small data
types (8 or 16 bits).
The Pentium family provides 57 MMX instructions. They treat data in a SIMD
fashion.
Multimedia Extensions to General Purpose
Microprocessors
Use the entire width of the data path (32 or 64 bits) when processing
small data types used in signal processing (8, 12, or 16 bits).
With word size 64 bits, the adders will be used to implement eight 8 bit
additions in parallel.