0% found this document useful (0 votes)

6 views

CA Chap7 Multicores Multiprocessors

Uploaded by

elsword26072002

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

CA Chap7 Multicores Multiprocessors

Uploaded by

elsword26072002

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Chapter 7:

Multicores, Multiprocessors

[with materials from Computer Organization and Design, 4th Edition,

1
7.1. Introduction

❑ Goal: connecting multiple computers

to get higher performance
l Multiprocessors
l Scalability, availability, power efficiency Job 1 Job 2

❑ Job-level (process-level) parallelism

l High throughput for independent jobs Process

❑ Parallel processing program Processor 1 & 2

l Single program run on multiple processors

❑ Multicore microprocessors
l Chips with multiple processors (cores)
2
Types of Parallelism
Time

Time
Data-Level Parallelism (DLP)
Pipelining
Time

Time

Thread-Level Parallelism (TLP) Instruction-Level Parallelism (ILP)

3
Hardware and Software

Hardware Software

• Serial: e.g., Pentium 4 • Sequential: e.g., matrix

multiplication
• Parallel: e.g., quad-core • Concurrent: e.g.,
Xeon e5345 operating system

❑ Sequential/concurrent software can run on

serial/parallel hardware
l Challenge: making effective use of parallel hardware
Quard
Core

4
What We’ve Already Covered
❑ §2.11: Parallelism and Instructions
l Synchronization

❑ §3.6: Parallelism and Computer Arithmetic

l Associativity

❑ §4.10: Parallelism and Advanced Instruction-Level

Parallelism
❑ §5.8: Parallelism and Memory Hierarchies
l Cache Coherence

❑ §6.9: Parallelism and I/O:

l Redundant Arrays of Inexpensive Disks

5
Parallel Programming
❑ Parallel software is the problem
❑ Need to get significant performance improvement
l Otherwise, just use a faster uniprocessor, since it’s easier!

❑ Difficulties
l Partitioning
l Coordination
l Communications overhead

6
Amdahl’s Law
❑ Sequential part can limit speedup
❑ Example: 100 processors, 90× speedup?
l Told = Tparallelizable + Tsequential
l Tnew = Tparallelizable/100 + Tsequential
1
l Speedup = = 90
(1− Fparallelizable ) + Fparallelizable /100
l Solving: Fparallelizable = 0.999

❑ Need sequential part to be 0.1% of original time

7
Scaling Example
❑ Workload: sum of 10 scalars, and 10 × 10 matrix
sum
l Speed up from 10 to 100 processors
❑ Single processor: Time = (10 + 100) × tadd
❑ 10 processors
l Time = 10 × tadd + 100/10 × tadd = 20 × tadd
l Speedup = 110/20 = 5.5 (5.5/10 = 55% of potential)
❑ 100 processors
l Time = 10 × tadd + 100/100 × tadd = 11 × tadd
l Speedup = 110/11 = 10 (10/100 = 10% of potential)
❑ Assumes load can be balanced across
processors 8
Scaling Example (cont)

❑ What if matrix size is 100 × 100?

❑ Single processor: Time = (10 + 10000) × tadd
❑ 10 processors
l Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
l Speedup = 10010/1010 = 9.9 (99% of potential)
❑ 100 processors
l Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
l Speedup = 10010/110 = 91 (91% of potential)

❑ Assuming load balanced

9
Strong vs Weak Scaling

❑ Strong scaling: problem size fixed

l As in example

❑ Weakscaling: problem size proportional to

number of processors
Processor
Workload
l 10 processors, 10 × 10 matrix Nums

- Time = 20 × tadd
l 100 processors ↑, 32 × 32 matrix ↑
- Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
l Constant performance in this example

10
7.2. Shared Memory Multiprocessor

❑ SMP: shared memory multiprocessor

l Hardware provides single physical
address space for all processors
l Synchronize shared variables using locks
l Usually adapted in general purpose CPU’s in laptops and
desktops

❑ Memory access time: UMA vs NUMA

11
Shared Memory Arch: UMA

❑ access time to a memory location is independent of which

processor makes the request, or which memory chip
contains the transferred data.
❑ Used for a few processors.

Intel's FSB based UMA Arch

P P P
C C C

Bus contention

M M

12
Shared Memory Arch: NUMA

❑ access time depends on the memory location relative to a

processor.
❑ Used for dozens, hundreds of processors
❑ Processors use the same memory address space.
Distributed Shared Memory, DSM
❑ Intel QPI completes with AMD HyperTransport, not bus.
Quick Path Interconnect Arch
Embedded
Mem P P P
Controller
C C C
direct
M M M

Bus

13
Shared Memory Arch: NUMA

❑ Eg. the memory manager of programming languages also need to be

NUMA aware. Java is NUMA aware.
❑ Eg. Oracle 11g explicitly enabled for NUMA support
❑ Eg. Windows XP SP2, Server 2003, Vista supported NUMA

P P P P P P
C C C C C C

M M M M M M

Bus

NUMA node SMP NUMA node

14
Example:Sun Fire V210 / V240 Mainboard

15
Example:Dell PowerEdge R720

16
Example: Sum Reduction

❑ Sum 100,000 numbers on 100 processor UMA

l Each processor has ID: 0 ≤ Pn ≤ 99
l Partition 1000 numbers per processor
l Initial summation on each processor
100,000 numbers

sum[P0] = 0;
sum[Pn] = 0;
for (i=1000*P0;i<1000*(P0+1);i++)
sum[P0] = sum[P0] for (i=1000*Pn;i<1000*(Pn+1); i++)
+ A[i];
sum[Pn] = sum[Pn] + A[i];

❑ Now need to add these partial sums

l Reduction: divide and conquer
l Half the processors add pairs, then quarter, …
l Need to synchronize between reduction steps 17
Example: Sum Reduction

0 1 2 3

(half is odd) 0 1 2 3 4 5 6

half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; 18
Message Passing

❑ Each processor has private physical address space

❑ Hardware sends/receives messages between processors

19
Loosely Coupled Clusters

❑ Network of independent computers

l Each has private memory and OS
l Connected using I/O system
- E.g., Ethernet/switch, Internet

❑ Suitable for applications with independent tasks

l Web servers, databases, simulations, …

❑ High availability, scalable, affordable

❑ Problems
l Administration cost (prefer virtual machines)
l Low interconnect bandwidth
- c.f. processor/memory bandwidth on an SMP
20
Sum Reduction (Again)
❑ Sum 100,000 on 100 processors
❑ First distribute 1000 numbers to each
l The do partial sums

1,000 numbers

sum = 0;
❑ Reduction for (i=0;i<1000;i++)
sum = sum + AN[i];
l Half the processors send, other half receive and add
l The quarter send, quarter receive and add, …

21
Sum Reduction (Again)

❑ Given send() and receive() operations

limit = 100; half = 100; /* 100 processors */
repeat
half = (half+1)/2; /*send vs. receive dividing line*/
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */

half =7 0 1 2 3 4 5 6

l Send/receive also provide synchronization

l Assumes send/receive take similar time to addition

22
Message Passing Systems
❑ONC RPC, CORBA, Java RMI, DCOM, SOAP, .NET
Remoting, CTOS, QNX Neutrino RTOS, OpenBinder, D-
Bus, Unison RTOS
❑ Message passing systems have been called "shared
nothing" systems (each participant is a black box).
❑Message passing is a type of communication between
processes or objects in computer science
❑ Opensource: Beowulf, Microwulf

Microwulf Beowulf A sending-message cluster

23
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams

❑ An alternate classification

Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345

◼ SPMD: Single Program Multiple Data

◼ A parallel program on a MIMD computer
◼ Conditional code for different processors

24
Single Instruction, Single Data

❑ Single Instruction: Only one

instruction stream is being acted on
by the CPU during any one clock
cycle
❑ SingleData: Only one data stream is
being used as input during any one
clock cycle
❑ Deterministic execution
❑ Examples: older generation
mainframes, minicomputers and
workstations; most modern day PCs.
25
Single Instruction, Single Data

UNIVAC1 IBM 360 CRAY 1

CDC 7600 PDP 1 Laptop

26
Multi Instruction, Multi Data
◼ Multiple Instruction: Every
processor may be executing a
different instruction stream
◼ Multiple Data: Every processor
may be working with a different
data stream
❑ Execution can be synchronous or asynchronous, deterministic or
non-deterministic
❑ Currently, the most common type of parallel computer - most
modern supercomputers fall into this category.
❑ Examples: most current supercomputers, networked parallel
computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.
❑ Note: many MIMD architectures also include SIMD execution sub-
components
27
Multi Instruction, Multi Data

HP/Compaq
IBM POWER5 Intel IA32
Alphaserver

AMD Opteron Cray XT3 IBM BG/L

28
Single Instruction, Multiple Data
❑ Operate elementwise on vectors of data
l E.g., MMX and SSE instructions in x86
- Multiple data elements in 128-bit wide registers

❑ Allprocessors execute the same instruction at

the same time
l Each with different data address, etc.

❑ Reduced instruction
control hardware
❑ Works best for highly
data-parallel applications, high degree of
regularity, such as graphics/image processing 29
Single Instruction, Multiple Data

❑ Synchronous (lockstep) and deterministic execution

❑ Two varieties: Processor Arrays and Vector Pipelines
❑ Most modern computers, particularly those with
graphics processor units (GPUs) employ SIMD
instructions and execution units.

30
Single Instruction, Multiple Data

ILLIAC IV MasPar

CRAY X-MP Cray Y-MP

Thinking Machines
GPU
CM-2

31
Vector Processors

❑ Highly pipelined function units

❑ Stream data from/to vector registers to units
l Data collected from memory into registers
l Results stored from registers to memory

❑ Example: Vector extension to MIPS

l 32 × 64-element registers (64-bit elements)
l Vector instructions
-lv, sv: load/store vector
-addv.d: add vectors of double
-addvs.d: add scalar to each element of vector of double

❑ Significantly reduces instruction-fetch bandwidth

32
Example: DAXPY (Y = a × X + Y)
❑ Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;upper bound of what to load
loop: l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done
❑ Vector MIPS code
l.d $f0,a($sp) ;load scalar a
lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result

33
Vector vs. Scalar
❑ Vector architectures and compilers
l Simplify data-parallel programming
l Explicit statement of absence of loop-carried dependences
- Reduced checking in hardware
l Regular access patterns benefit from interleaved and burst
memory
l Avoid control hazards by avoiding loops

❑ More general than ad-hoc media extensions (such as

MMX, SSE)
l Better match with compiler technology

34
§7.7 Introduction to Graphics Processing Units
7.3. Introduction to GPUs

Early video cards

• Frame buffer memory with address generation for
video output

3D graphics processing
• Originally high-end computers (e.g., SGI)
• Moore’s Law  lower cost, higher density
• 3D graphics cards for PCs and game consoles

Graphics Processing Units

• Processors oriented to 3D graphics tasks
• Vertex/pixel processing, shading, texture mapping,
rasterization

35
Graphics in the System

36
GPU Architectures
❑ Processing is highly data-parallel
l GPUs are highly multithreaded
l Use thread switching to hide memory latency
- Less reliance on multi-level caches
l Graphics memory is wide and high-bandwidth
❑ Trend toward general purpose GPUs
l Heterogeneous CPU/GPU systems
l CPU for sequential code, GPU for parallel code
❑ Programming languages/APIs
l DirectX, OpenGL
l C for Graphics (Cg), High Level Shader Language
(HLSL)
Heterogeneous: không đồng nhất
l Compute Unified Device Architecture (CUDA) 37
Example: NVIDIA Tesla

Streaming
Multiprocessor

8 × Streaming
processors

38
Example: NVIDIA Tesla
❑ Streaming Processors
l Single-precision FP and integer units
l Each SP is fine-grained multithreaded

❑ Warp: group of 32 threads

l Executed in parallel,
SIMD style
- 8 SPs
× 4 clock cycles
l Hardware contexts
for 24 warps
- Registers, PCs, …

39
§7.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks

❑ Network topologies
l Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

40
Network Characteristics
❑ Performance
l Latency per message (unloaded network)
l Throughput
- Link bandwidth
- Total network bandwidth
- Bisection bandwidth
l Congestion delays (depending on traffic)

❑ Cost
❑ Power
❑ Routability in silicon

41
§7.13 Concluding Remarks
Concluding Remarks

❑ Goal:higher performance by using multiple

processors
❑ Difficulties
l Developing parallel software
l Devising appropriate architectures

❑ Many reasons for optimism

l Changing software and application environment
l Chip-level multiprocessors with lower latency, higher
bandwidth interconnect

❑ An ongoing challenge for computer architects!

Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
100% (1)
Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
57 pages
BEF 22903 - Lecture 5 - Unbalanced Three-Phase Circuits
No ratings yet
BEF 22903 - Lecture 5 - Unbalanced Three-Phase Circuits
13 pages
Customer Experience Specialist CV
No ratings yet
Customer Experience Specialist CV
4 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Multicores, Multiprocessors, and P, Clusters
No ratings yet
Multicores, Multiprocessors, and P, Clusters
51 pages
Introduction to Paralel Procesing
No ratings yet
Introduction to Paralel Procesing
40 pages
Par Seq Algorithms
No ratings yet
Par Seq Algorithms
44 pages
Chapter 6 Parallel Processor
No ratings yet
Chapter 6 Parallel Processor
21 pages
Patterson6e_MIPS_Ch06_PPT(2) (1)
No ratings yet
Patterson6e_MIPS_Ch06_PPT(2) (1)
74 pages
Class8
No ratings yet
Class8
72 pages
Chapter 06
No ratings yet
Chapter 06
57 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
2.0 DD2356 DiscussingSpeedUp
No ratings yet
2.0 DD2356 DiscussingSpeedUp
13 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Intro To Communication: - Advantages
No ratings yet
Intro To Communication: - Advantages
13 pages
hpc-Neal
No ratings yet
hpc-Neal
32 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Parallel Random Access Machine
No ratings yet
Parallel Random Access Machine
22 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Massively Parallel Processors
No ratings yet
Massively Parallel Processors
102 pages
Multiprocessors: Cs 152 L1 5 .1 DAP Fa97, U.CB
No ratings yet
Multiprocessors: Cs 152 L1 5 .1 DAP Fa97, U.CB
38 pages
Ca Lecture 11
No ratings yet
Ca Lecture 11
10 pages
Lec2-Analyzing Algos New
No ratings yet
Lec2-Analyzing Algos New
69 pages
CA Chap7 Multiprocessing
No ratings yet
CA Chap7 Multiprocessing
26 pages
L7 Multicore 2
No ratings yet
L7 Multicore 2
22 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Chapter 06
No ratings yet
Chapter 06
57 pages
Parallel Computing
No ratings yet
Parallel Computing
30 pages
Lec 01
No ratings yet
Lec 01
2 pages
Module 4
No ratings yet
Module 4
12 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
pp assignment
No ratings yet
pp assignment
6 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Parallel & Distributed Computing:: Spring-2020 Lec#1
No ratings yet
Parallel & Distributed Computing:: Spring-2020 Lec#1
19 pages
Cluster Parallel
No ratings yet
Cluster Parallel
29 pages
CSE Lec3
No ratings yet
CSE Lec3
22 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
CA Chap7 Multiprocessing
No ratings yet
CA Chap7 Multiprocessing
35 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
No ratings yet
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
73 pages
Ecole Militaire Polytechnique: Content
No ratings yet
Ecole Militaire Polytechnique: Content
16 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
daa_unit-v
No ratings yet
daa_unit-v
50 pages
Multithreading Algorithms
No ratings yet
Multithreading Algorithms
36 pages
Parallel
No ratings yet
Parallel
20 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
HPC Int I Retest Answer Key
No ratings yet
HPC Int I Retest Answer Key
10 pages
Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester
No ratings yet
Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester
81 pages
Computer Performance
No ratings yet
Computer Performance
35 pages
Calculating Prime Numbers Comparing Java, C, and Cuda
No ratings yet
Calculating Prime Numbers Comparing Java, C, and Cuda
27 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
No ratings yet
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
13 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Iperf
No ratings yet
Iperf
1 page
Untitled document (3)
No ratings yet
Untitled document (3)
63 pages
Untitled document (2)
No ratings yet
Untitled document (2)
39 pages
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
SCS - 0915 - 126 Advanced Management Accounting - Final Exam
No ratings yet
SCS - 0915 - 126 Advanced Management Accounting - Final Exam
18 pages
1dac10-G6-Q3-Itfa Assignment Submission Form
No ratings yet
1dac10-G6-Q3-Itfa Assignment Submission Form
4 pages
Web Developement By-: Ranjit Kumar Sah Bca 5 GU-2019-2135
No ratings yet
Web Developement By-: Ranjit Kumar Sah Bca 5 GU-2019-2135
16 pages
Mid-Term Summer-2021 Faculty of Engineering Sciences and Technology
No ratings yet
Mid-Term Summer-2021 Faculty of Engineering Sciences and Technology
4 pages
TTC Policy 2023
No ratings yet
TTC Policy 2023
12 pages
Balanon Detailed Lesson Plan..5
0% (1)
Balanon Detailed Lesson Plan..5
12 pages
Basic Structured Queru Language Tvet
100% (1)
Basic Structured Queru Language Tvet
32 pages
Simplification and Approximation PDF
No ratings yet
Simplification and Approximation PDF
8 pages
Representing The Life and Legacy of Renée de France: From Fille de France To Dowager Duchess 1st Edition Kelly Digby Peebles
No ratings yet
Representing The Life and Legacy of Renée de France: From Fille de France To Dowager Duchess 1st Edition Kelly Digby Peebles
54 pages
Overview of The Fourth-Generation Mobile Communication System
No ratings yet
Overview of The Fourth-Generation Mobile Communication System
6 pages
Practical Exam: Aim: To Design A State Diagram For Online Shopping Theory
No ratings yet
Practical Exam: Aim: To Design A State Diagram For Online Shopping Theory
4 pages
Gracedazaresumebluegray Tacloban
No ratings yet
Gracedazaresumebluegray Tacloban
1 page
Raja Sharma Pan Fee
No ratings yet
Raja Sharma Pan Fee
1 page
118, Varasiddi Building, 1st Floor, 2nd Main RD, Near Motherhood Medical Center, Kasturi Nagar, Bengaluru, Karnataka 560043
No ratings yet
118, Varasiddi Building, 1st Floor, 2nd Main RD, Near Motherhood Medical Center, Kasturi Nagar, Bengaluru, Karnataka 560043
13 pages
NA11 DBX Manual PR - DNP3 - 0250
No ratings yet
NA11 DBX Manual PR - DNP3 - 0250
14 pages
Timing Diagram
No ratings yet
Timing Diagram
6 pages
Technical SPecs
No ratings yet
Technical SPecs
2 pages
Freedom
No ratings yet
Freedom
392 pages
National Stock Exchange of India Ltd. Surrender-User Manual
No ratings yet
National Stock Exchange of India Ltd. Surrender-User Manual
34 pages
Working With Vi
No ratings yet
Working With Vi
5 pages
Sunil Sharma PHD Proposal
No ratings yet
Sunil Sharma PHD Proposal
6 pages
Web Design Work Breakdown Structure
No ratings yet
Web Design Work Breakdown Structure
5 pages
250+ TOP MCQs On Dependability and Security Specification and Answers 2023
No ratings yet
250+ TOP MCQs On Dependability and Security Specification and Answers 2023
6 pages
Year 5 Find Missing Numbers RPS
No ratings yet
Year 5 Find Missing Numbers RPS
2 pages
Final_Internship_Report[Tausif_Sayyadl][1]
No ratings yet
Final_Internship_Report[Tausif_Sayyadl][1]
25 pages
Form 60
No ratings yet
Form 60
5 pages
SEO Proposal For ..........
No ratings yet
SEO Proposal For ..........
17 pages
Report of Remote File Sharing Plat Form
No ratings yet
Report of Remote File Sharing Plat Form
24 pages

CA Chap7 Multicores Multiprocessors

Uploaded by

CA Chap7 Multicores Multiprocessors

Uploaded by

Chapter 7:

[with materials from Computer Organization and Design, 4th Edition,

❑ Goal: connecting multiple computers

❑ Job-level (process-level) parallelism

❑ Parallel processing program Processor 1 & 2

Thread-Level Parallelism (TLP) Instruction-Level Parallelism (ILP)

• Serial: e.g., Pentium 4 • Sequential: e.g., matrix

❑ Sequential/concurrent software can run on

❑ §3.6: Parallelism and Computer Arithmetic

❑ §4.10: Parallelism and Advanced Instruction-Level

❑ §6.9: Parallelism and I/O:

❑ Need sequential part to be 0.1% of original time

❑ What if matrix size is 100 × 100?

❑ Assuming load balanced

❑ Strong scaling: problem size fixed

❑ Weakscaling: problem size proportional to

❑ SMP: shared memory multiprocessor

❑ Memory access time: UMA vs NUMA

❑ access time to a memory location is independent of which

Intel's FSB based UMA Arch

❑ access time depends on the memory location relative to a

❑ Eg. the memory manager of programming languages also need to be

NUMA node SMP NUMA node

❑ Sum 100,000 numbers on 100 processor UMA

❑ Now need to add these partial sums

❑ Each processor has private physical address space

❑ Network of independent computers

❑ Suitable for applications with independent tasks

❑ High availability, scalable, affordable

❑ Given send() and receive() operations

l Send/receive also provide synchronization

Microwulf Beowulf A sending-message cluster

◼ SPMD: Single Program Multiple Data

❑ Single Instruction: Only one

UNIVAC1 IBM 360 CRAY 1

CDC 7600 PDP 1 Laptop

AMD Opteron Cray XT3 IBM BG/L

❑ Allprocessors execute the same instruction at

❑ Synchronous (lockstep) and deterministic execution

CRAY X-MP Cray Y-MP

❑ Highly pipelined function units

❑ Example: Vector extension to MIPS

❑ Significantly reduces instruction-fetch bandwidth

❑ More general than ad-hoc media extensions (such as

Early video cards

Graphics Processing Units

❑ Warp: group of 32 threads

❑ Goal:higher performance by using multiple

❑ Many reasons for optimism

❑ An ongoing challenge for computer architects!

You might also like