CA Chap7 Multicores Multiprocessors
CA Chap7 Multicores Multiprocessors
Multicores, Multiprocessors
1
7.1. Introduction
❑ Multicore microprocessors
l Chips with multiple processors (cores)
2
Types of Parallelism
Time
Time
Data-Level Parallelism (DLP)
Pipelining
Time
Time
Hardware Software
4
What We’ve Already Covered
❑ §2.11: Parallelism and Instructions
l Synchronization
5
Parallel Programming
❑ Parallel software is the problem
❑ Need to get significant performance improvement
l Otherwise, just use a faster uniprocessor, since it’s easier!
❑ Difficulties
l Partitioning
l Coordination
l Communications overhead
6
Amdahl’s Law
❑ Sequential part can limit speedup
❑ Example: 100 processors, 90× speedup?
l Told = Tparallelizable + Tsequential
l Tnew = Tparallelizable/100 + Tsequential
1
l Speedup = = 90
(1− Fparallelizable ) + Fparallelizable /100
l Solving: Fparallelizable = 0.999
7
Scaling Example
❑ Workload: sum of 10 scalars, and 10 × 10 matrix
sum
l Speed up from 10 to 100 processors
❑ Single processor: Time = (10 + 100) × tadd
❑ 10 processors
l Time = 10 × tadd + 100/10 × tadd = 20 × tadd
l Speedup = 110/20 = 5.5 (5.5/10 = 55% of potential)
❑ 100 processors
l Time = 10 × tadd + 100/100 × tadd = 11 × tadd
l Speedup = 110/11 = 10 (10/100 = 10% of potential)
❑ Assumes load can be balanced across
processors 8
Scaling Example (cont)
9
Strong vs Weak Scaling
- Time = 20 × tadd
l 100 processors ↑, 32 × 32 matrix ↑
- Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
l Constant performance in this example
10
7.2. Shared Memory Multiprocessor
11
Shared Memory Arch: UMA
Bus contention
M M
12
Shared Memory Arch: NUMA
Bus
13
Shared Memory Arch: NUMA
P P P P P P
C C C C C C
M M M M M M
Bus
14
Example:Sun Fire V210 / V240 Mainboard
15
Example:Dell PowerEdge R720
16
Example: Sum Reduction
sum[P0] = 0;
sum[Pn] = 0;
for (i=1000*P0;i<1000*(P0+1);i++)
sum[P0] = sum[P0] for (i=1000*Pn;i<1000*(Pn+1); i++)
+ A[i];
sum[Pn] = sum[Pn] + A[i];
0 1 2 3
(half is odd) 0 1 2 3 4 5 6
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; 18
Message Passing
19
Loosely Coupled Clusters
1,000 numbers
sum = 0;
❑ Reduction for (i=0;i<1000;i++)
sum = sum + AN[i];
l Half the processors send, other half receive and add
l The quarter send, quarter receive and add, …
21
Sum Reduction (Again)
half =7 0 1 2 3 4 5 6
22
Message Passing Systems
❑ONC RPC, CORBA, Java RMI, DCOM, SOAP, .NET
Remoting, CTOS, QNX Neutrino RTOS, OpenBinder, D-
Bus, Unison RTOS
❑ Message passing systems have been called "shared
nothing" systems (each participant is a black box).
❑Message passing is a type of communication between
processes or objects in computer science
❑ Opensource: Beowulf, Microwulf
❑ An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
24
Single Instruction, Single Data
26
Multi Instruction, Multi Data
◼ Multiple Instruction: Every
processor may be executing a
different instruction stream
◼ Multiple Data: Every processor
may be working with a different
data stream
❑ Execution can be synchronous or asynchronous, deterministic or
non-deterministic
❑ Currently, the most common type of parallel computer - most
modern supercomputers fall into this category.
❑ Examples: most current supercomputers, networked parallel
computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.
❑ Note: many MIMD architectures also include SIMD execution sub-
components
27
Multi Instruction, Multi Data
HP/Compaq
IBM POWER5 Intel IA32
Alphaserver
28
Single Instruction, Multiple Data
❑ Operate elementwise on vectors of data
l E.g., MMX and SSE instructions in x86
- Multiple data elements in 128-bit wide registers
❑ Reduced instruction
control hardware
❑ Works best for highly
data-parallel applications, high degree of
regularity, such as graphics/image processing 29
Single Instruction, Multiple Data
30
Single Instruction, Multiple Data
ILLIAC IV MasPar
Thinking Machines
GPU
CM-2
31
Vector Processors
33
Vector vs. Scalar
❑ Vector architectures and compilers
l Simplify data-parallel programming
l Explicit statement of absence of loop-carried dependences
- Reduced checking in hardware
l Regular access patterns benefit from interleaved and burst
memory
l Avoid control hazards by avoiding loops
34
§7.7 Introduction to Graphics Processing Units
7.3. Introduction to GPUs
3D graphics processing
• Originally high-end computers (e.g., SGI)
• Moore’s Law lower cost, higher density
• 3D graphics cards for PCs and game consoles
35
Graphics in the System
36
GPU Architectures
❑ Processing is highly data-parallel
l GPUs are highly multithreaded
l Use thread switching to hide memory latency
- Less reliance on multi-level caches
l Graphics memory is wide and high-bandwidth
❑ Trend toward general purpose GPUs
l Heterogeneous CPU/GPU systems
l CPU for sequential code, GPU for parallel code
❑ Programming languages/APIs
l DirectX, OpenGL
l C for Graphics (Cg), High Level Shader Language
(HLSL)
Heterogeneous: không đồng nhất
l Compute Unified Device Architecture (CUDA) 37
Example: NVIDIA Tesla
Streaming
Multiprocessor
8 × Streaming
processors
38
Example: NVIDIA Tesla
❑ Streaming Processors
l Single-precision FP and integer units
l Each SP is fine-grained multithreaded
39
§7.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks
❑ Network topologies
l Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
40
Network Characteristics
❑ Performance
l Latency per message (unloaded network)
l Throughput
- Link bandwidth
- Total network bandwidth
- Bisection bandwidth
l Congestion delays (depending on traffic)
❑ Cost
❑ Power
❑ Routability in silicon
41
§7.13 Concluding Remarks
Concluding Remarks
43