0% found this document useful (0 votes)
6 views

CA Chap7 Multicores Multiprocessors

Uploaded by

elsword26072002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

CA Chap7 Multicores Multiprocessors

Uploaded by

elsword26072002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Chapter 7:

Multicores, Multiprocessors

[with materials from Computer Organization and Design, 4th Edition,


Patterson & Hennessy, © 2008, MK]

1
7.1. Introduction

❑ Goal: connecting multiple computers


to get higher performance
l Multiprocessors
l Scalability, availability, power efficiency Job 1 Job 2

❑ Job-level (process-level) parallelism


l High throughput for independent jobs Process

❑ Parallel processing program Processor 1 & 2


l Single program run on multiple processors

❑ Multicore microprocessors
l Chips with multiple processors (cores)
2
Types of Parallelism
Time

Time
Data-Level Parallelism (DLP)
Pipelining
Time

Time

Thread-Level Parallelism (TLP) Instruction-Level Parallelism (ILP)


3
Hardware and Software

Hardware Software

• Serial: e.g., Pentium 4 • Sequential: e.g., matrix


multiplication
• Parallel: e.g., quad-core • Concurrent: e.g.,
Xeon e5345 operating system

❑ Sequential/concurrent software can run on


serial/parallel hardware
l Challenge: making effective use of parallel hardware
Quard
Core

4
What We’ve Already Covered
❑ §2.11: Parallelism and Instructions
l Synchronization

❑ §3.6: Parallelism and Computer Arithmetic


l Associativity

❑ §4.10: Parallelism and Advanced Instruction-Level


Parallelism
❑ §5.8: Parallelism and Memory Hierarchies
l Cache Coherence

❑ §6.9: Parallelism and I/O:


l Redundant Arrays of Inexpensive Disks

5
Parallel Programming
❑ Parallel software is the problem
❑ Need to get significant performance improvement
l Otherwise, just use a faster uniprocessor, since it’s easier!

❑ Difficulties
l Partitioning
l Coordination
l Communications overhead

6
Amdahl’s Law
❑ Sequential part can limit speedup
❑ Example: 100 processors, 90× speedup?
l Told = Tparallelizable + Tsequential
l Tnew = Tparallelizable/100 + Tsequential
1
l Speedup = = 90
(1− Fparallelizable ) + Fparallelizable /100
l Solving: Fparallelizable = 0.999

❑ Need sequential part to be 0.1% of original time

7
Scaling Example
❑ Workload: sum of 10 scalars, and 10 × 10 matrix
sum
l Speed up from 10 to 100 processors
❑ Single processor: Time = (10 + 100) × tadd
❑ 10 processors
l Time = 10 × tadd + 100/10 × tadd = 20 × tadd
l Speedup = 110/20 = 5.5 (5.5/10 = 55% of potential)
❑ 100 processors
l Time = 10 × tadd + 100/100 × tadd = 11 × tadd
l Speedup = 110/11 = 10 (10/100 = 10% of potential)
❑ Assumes load can be balanced across
processors 8
Scaling Example (cont)

❑ What if matrix size is 100 × 100?


❑ Single processor: Time = (10 + 10000) × tadd
❑ 10 processors
l Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
l Speedup = 10010/1010 = 9.9 (99% of potential)
❑ 100 processors
l Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
l Speedup = 10010/110 = 91 (91% of potential)

❑ Assuming load balanced

9
Strong vs Weak Scaling

❑ Strong scaling: problem size fixed


l As in example

❑ Weakscaling: problem size proportional to


number of processors
Processor
Workload
l 10 processors, 10 × 10 matrix Nums

- Time = 20 × tadd
l 100 processors ↑, 32 × 32 matrix ↑
- Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
l Constant performance in this example

10
7.2. Shared Memory Multiprocessor

❑ SMP: shared memory multiprocessor


l Hardware provides single physical
address space for all processors
l Synchronize shared variables using locks
l Usually adapted in general purpose CPU’s in laptops and
desktops

❑ Memory access time: UMA vs NUMA

11
Shared Memory Arch: UMA

❑ access time to a memory location is independent of which


processor makes the request, or which memory chip
contains the transferred data.
❑ Used for a few processors.

Intel's FSB based UMA Arch


P P P
C C C

Bus contention

M M

12
Shared Memory Arch: NUMA

❑ access time depends on the memory location relative to a


processor.
❑ Used for dozens, hundreds of processors
❑ Processors use the same memory address space.
Distributed Shared Memory, DSM
❑ Intel QPI completes with AMD HyperTransport, not bus.
Quick Path Interconnect Arch
Embedded
Mem P P P
Controller
C C C
direct
M M M

Bus

13
Shared Memory Arch: NUMA

❑ Eg. the memory manager of programming languages also need to be


NUMA aware. Java is NUMA aware.
❑ Eg. Oracle 11g explicitly enabled for NUMA support
❑ Eg. Windows XP SP2, Server 2003, Vista supported NUMA

P P P P P P
C C C C C C

M M M M M M

Bus

NUMA node SMP NUMA node

14
Example:Sun Fire V210 / V240 Mainboard

15
Example:Dell PowerEdge R720

16
Example: Sum Reduction

❑ Sum 100,000 numbers on 100 processor UMA


l Each processor has ID: 0 ≤ Pn ≤ 99
l Partition 1000 numbers per processor
l Initial summation on each processor
100,000 numbers

sum[P0] = 0;
sum[Pn] = 0;
for (i=1000*P0;i<1000*(P0+1);i++)
sum[P0] = sum[P0] for (i=1000*Pn;i<1000*(Pn+1); i++)
+ A[i];
sum[Pn] = sum[Pn] + A[i];

❑ Now need to add these partial sums


l Reduction: divide and conquer
l Half the processors add pairs, then quarter, …
l Need to synchronize between reduction steps 17
Example: Sum Reduction

0 1 2 3

(half is odd) 0 1 2 3 4 5 6

half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; 18
Message Passing

❑ Each processor has private physical address space


❑ Hardware sends/receives messages between processors

19
Loosely Coupled Clusters

❑ Network of independent computers


l Each has private memory and OS
l Connected using I/O system
- E.g., Ethernet/switch, Internet

❑ Suitable for applications with independent tasks


l Web servers, databases, simulations, …

❑ High availability, scalable, affordable


❑ Problems
l Administration cost (prefer virtual machines)
l Low interconnect bandwidth
- c.f. processor/memory bandwidth on an SMP
20
Sum Reduction (Again)
❑ Sum 100,000 on 100 processors
❑ First distribute 1000 numbers to each
l The do partial sums

1,000 numbers

sum = 0;
❑ Reduction for (i=0;i<1000;i++)
sum = sum + AN[i];
l Half the processors send, other half receive and add
l The quarter send, quarter receive and add, …

21
Sum Reduction (Again)

❑ Given send() and receive() operations


limit = 100; half = 100; /* 100 processors */
repeat
half = (half+1)/2; /*send vs. receive dividing line*/
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */

half =7 0 1 2 3 4 5 6

l Send/receive also provide synchronization


l Assumes send/receive take similar time to addition

22
Message Passing Systems
❑ONC RPC, CORBA, Java RMI, DCOM, SOAP, .NET
Remoting, CTOS, QNX Neutrino RTOS, OpenBinder, D-
Bus, Unison RTOS
❑ Message passing systems have been called "shared
nothing" systems (each participant is a black box).
❑Message passing is a type of communication between
processes or objects in computer science
❑ Opensource: Beowulf, Microwulf

Microwulf Beowulf A sending-message cluster


23
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams

❑ An alternate classification

Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345

◼ SPMD: Single Program Multiple Data


◼ A parallel program on a MIMD computer
◼ Conditional code for different processors

24
Single Instruction, Single Data

❑ Single Instruction: Only one


instruction stream is being acted on
by the CPU during any one clock
cycle
❑ SingleData: Only one data stream is
being used as input during any one
clock cycle
❑ Deterministic execution
❑ Examples: older generation
mainframes, minicomputers and
workstations; most modern day PCs.
25
Single Instruction, Single Data

UNIVAC1 IBM 360 CRAY 1

CDC 7600 PDP 1 Laptop

26
Multi Instruction, Multi Data
◼ Multiple Instruction: Every
processor may be executing a
different instruction stream
◼ Multiple Data: Every processor
may be working with a different
data stream
❑ Execution can be synchronous or asynchronous, deterministic or
non-deterministic
❑ Currently, the most common type of parallel computer - most
modern supercomputers fall into this category.
❑ Examples: most current supercomputers, networked parallel
computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.
❑ Note: many MIMD architectures also include SIMD execution sub-
components
27
Multi Instruction, Multi Data

HP/Compaq
IBM POWER5 Intel IA32
Alphaserver

AMD Opteron Cray XT3 IBM BG/L

28
Single Instruction, Multiple Data
❑ Operate elementwise on vectors of data
l E.g., MMX and SSE instructions in x86
- Multiple data elements in 128-bit wide registers

❑ Allprocessors execute the same instruction at


the same time
l Each with different data address, etc.

❑ Reduced instruction
control hardware
❑ Works best for highly
data-parallel applications, high degree of
regularity, such as graphics/image processing 29
Single Instruction, Multiple Data

❑ Synchronous (lockstep) and deterministic execution


❑ Two varieties: Processor Arrays and Vector Pipelines
❑ Most modern computers, particularly those with
graphics processor units (GPUs) employ SIMD
instructions and execution units.

30
Single Instruction, Multiple Data

ILLIAC IV MasPar

CRAY X-MP Cray Y-MP

Thinking Machines
GPU
CM-2

31
Vector Processors

❑ Highly pipelined function units


❑ Stream data from/to vector registers to units
l Data collected from memory into registers
l Results stored from registers to memory

❑ Example: Vector extension to MIPS


l 32 × 64-element registers (64-bit elements)
l Vector instructions
-lv, sv: load/store vector
-addv.d: add vectors of double
-addvs.d: add scalar to each element of vector of double

❑ Significantly reduces instruction-fetch bandwidth


32
Example: DAXPY (Y = a × X + Y)
❑ Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;upper bound of what to load
loop: l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done
❑ Vector MIPS code
l.d $f0,a($sp) ;load scalar a
lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result

33
Vector vs. Scalar
❑ Vector architectures and compilers
l Simplify data-parallel programming
l Explicit statement of absence of loop-carried dependences
- Reduced checking in hardware
l Regular access patterns benefit from interleaved and burst
memory
l Avoid control hazards by avoiding loops

❑ More general than ad-hoc media extensions (such as


MMX, SSE)
l Better match with compiler technology

34
§7.7 Introduction to Graphics Processing Units
7.3. Introduction to GPUs

Early video cards


• Frame buffer memory with address generation for
video output

3D graphics processing
• Originally high-end computers (e.g., SGI)
• Moore’s Law  lower cost, higher density
• 3D graphics cards for PCs and game consoles

Graphics Processing Units


• Processors oriented to 3D graphics tasks
• Vertex/pixel processing, shading, texture mapping,
rasterization

35
Graphics in the System

36
GPU Architectures
❑ Processing is highly data-parallel
l GPUs are highly multithreaded
l Use thread switching to hide memory latency
- Less reliance on multi-level caches
l Graphics memory is wide and high-bandwidth
❑ Trend toward general purpose GPUs
l Heterogeneous CPU/GPU systems
l CPU for sequential code, GPU for parallel code
❑ Programming languages/APIs
l DirectX, OpenGL
l C for Graphics (Cg), High Level Shader Language
(HLSL)
Heterogeneous: không đồng nhất
l Compute Unified Device Architecture (CUDA) 37
Example: NVIDIA Tesla

Streaming
Multiprocessor

8 × Streaming
processors

38
Example: NVIDIA Tesla
❑ Streaming Processors
l Single-precision FP and integer units
l Each SP is fine-grained multithreaded

❑ Warp: group of 32 threads


l Executed in parallel,
SIMD style
- 8 SPs
× 4 clock cycles
l Hardware contexts
for 24 warps
- Registers, PCs, …

39
§7.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks

❑ Network topologies
l Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

40
Network Characteristics
❑ Performance
l Latency per message (unloaded network)
l Throughput
- Link bandwidth
- Total network bandwidth
- Bisection bandwidth
l Congestion delays (depending on traffic)

❑ Cost
❑ Power
❑ Routability in silicon

41
§7.13 Concluding Remarks
Concluding Remarks

❑ Goal:higher performance by using multiple


processors
❑ Difficulties
l Developing parallel software
l Devising appropriate architectures

❑ Many reasons for optimism


l Changing software and application environment
l Chip-level multiprocessors with lower latency, higher
bandwidth interconnect

❑ An ongoing challenge for computer architects!

43

You might also like