0% found this document useful (0 votes)
14 views74 pages

Patterson6e_MIPS_Ch06_PPT(2) (1)

Chapter 6 of 'Computer Organization and Design' discusses parallel processors and the challenges of creating efficient parallel processing programs. It covers key concepts like Amdahl's Law, strong vs. weak scaling, and SIMD (Single Instruction, Multiple Data) architectures, emphasizing the importance of optimizing software to leverage parallel hardware effectively. The chapter also highlights the advantages of vector processors and the complexities involved in parallel programming, including task partitioning and coordination.

Uploaded by

ssgrewal2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views74 pages

Patterson6e_MIPS_Ch06_PPT(2) (1)

Chapter 6 of 'Computer Organization and Design' discusses parallel processors and the challenges of creating efficient parallel processing programs. It covers key concepts like Amdahl's Law, strong vs. weak scaling, and SIMD (Single Instruction, Multiple Data) architectures, emphasizing the importance of optimizing software to leverage parallel hardware effectively. The chapter also highlights the advantages of vector processors and the complexities involved in parallel programming, including task partitioning and coordination.

Uploaded by

ssgrewal2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 74

COMPUTER ORGANIZATION AND DESIGN

6
Edition
th

The Hardware/Software Interface

Chapter 6
Parallel Processors from
Client to Cloud
PART 1

Chapter 2 — Instructions: Language of the Computer — 2


§6.1 Introduction
Introduction
 Goal: connecting multiple computers
to get higher performance
 Multiprocessors
 Scalability, availability, power efficiency
 Task-level (process-level) parallelism
 High throughput for independent jobs
 Parallel processing program
 Single program run on multiple processors
 Multicore microprocessors
 Chips with multiple processors (cores)
Chapter 6 — Parallel Processors from Client to Cloud — 3
Hardware and Software
 Hardware
 Serial:Executes one instruction at a time e.g., Pentium 4
 Parallel: Can perform multiple instructions simultaneously
using multiple corese.g., quad-core Xeon e5345
 Software
 Sequential: Tasks are completed one after another in a fixed order.
e.g., matrix multiplication
 Concurrent: Multiple tasks or processes are handled at the same
time. e.g., operating system
 Sequential/concurrent software can run on
serial/parallel hardware
 Challenge: optimizing software to take full advantage of parallel
hardware, which often requires careful programming and resource
management.
What We’ve Already Covered
 §2.11: Parallelism and Instructions
 Synchronization: ensures correct execution when multiple
threads access shared resources.
 §3.6: Parallelism and Computer Arithmetic
 Subword Parallelism: breaks data into smaller parts (like
bytes or words) and processes them simultaneously within a single
instruction.
 §4.11: Parallelism via Instructions: executing
multiple machine instructions at the same time to increase performance.
 §5.10: Parallelism and Memory Hierarchies
 Cache Coherence

Chapter 6 — Parallel Processors from Client to Cloud — 5


§6.2 The Difficulty of Creating Parallel Processing Programs
Parallel Programming
 Parallel software is the problem
 While hardware supports parallelism, writing efficient parallel
software is difficult. It's often the main bottleneck to achieving
performance gains.
 Need to get significant performance improvement
 Otherwise, just use a faster uniprocessor, since it’s
easier!
 Difficulties

Partitioning: Breaking the task into smaller parts that can be run in
parallel.

Coordination: Managing the flow and dependencies between
parallel tasks.
 Communications overhead: Time and resources spent transferring
data between parallel tasks can reduce performance gains.
Amdahl’s Law
Amdahl’s Law shows how much performance
improvement (speedup) you can expect from using
multiple processors — depending on how much of the
program can be parallelized.

Example: 100 processors, 90× speedup?


 If we use 100 processors, is it possible to make the program run 90
times faster than with 1 processor?
 100 processors: We are running the program in parallel across 100
CPU cores.
 90× speedup: We hope the program completes 90 times faster than it
would on a single processor.

Chapter 6 — Parallel Processors from Client to Cloud — 7


Amdahl’s Law
 Tnew = Tparallelizable/100 + Tsequential
This equation calculates the new execution time when using 100
processors.
 T parallelizable / 100: The parallel portion is divided across
100 processors (so it runs faster).
 T sequential: The sequential part must still run on one
processor and can’t be sped up.
 The Maximum speed we can achieve:
1
Speedup  90
(1 Fparallelizable )  Fparallelizable /100
• F_parallelizable: Fraction of the program that can be parallelized.
• 1- F_parallelizable: The remaining sequential part.
• 100: The number of processors.
Speedup = 90: This is the goal — to make the program run 90 times faster than it
would on a single processor.
Amdahl’s Law

 Solving the equation gives 𝐹


 Desired speedup = 90× with 100 processors.
𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑧𝑎𝑏𝑙𝑒= 0.999,
meaning 99.9% of the program must be
parallelizable.
 Therefore, the sequential part can only be 0.1%

of the original execution time to achieve such a


high speedup.
Summary:
No matter how many processors you add, you can't
speed up the sequential part, so it becomes the
bottleneck.
Chapter 6 — Parallel Processors from Client to Cloud — 9
Scaling Example
In this context, scaling refers to how well a computational
task performs as you increase the number of processors.
1. The Task (Workload):

Add 10 scalar numbers

Add all elements in a 10 × 10 matrix (which is 100 numbers)
Total work = 10 + 100 = 110 additions
2. Time on 1 Processor:

Processor does everything by itself.

Time = 110 × tadd So, Total time = 110 × tadd
3. Time on 10 Processors:

The 10 scalars must be added by one processor (can't split them
easily)
So, Time = 10 × tadd

The 100 matrix elements are evenly divided among 10 processors:
Each does 10 adds → 10 × tadd. So, Time = 10 × tadd
Total time = 10 (scalars) + 10 (matrix) = 20 × tadd
Scaling Example - cont
4. Speedup with 10 Processors:
Speedup = Time on 1 processor / Time on 10 processors = 110 / 20 =
5.5 , So, Only 55% of the ideal speedup (ideal = 10×)
 2. Time on 1 Processor:

Processor does everything by itself.

Time = 110 × tadd So, Total time = 110 × tadd
5. Time on 100 Processors:

Matrix split among 100 processors: each does 1 add → 1 × t add

Total time = 10 + 1 = 11 × tadd

6. Speedup with 100 Processors:


Speedup = 110 / 11 = 10, So Only 10% of the ideal speedup (ideal =
100×)
Final Point:

The serial part (scalar adds) becomes a bottleneck as processors
increase.

This shows scaling is limited by non-parallel work — an idea from
Amdahl’s Law.
Scaling Example (cont)
 What if matrix size is 100 × 100?
 Single processor: Time = (10 + 10000) × tadd
 10 processors
 Time = 10 × t
add + 10000/10 × tadd = 1010 × tadd
 Speedup = 10010/1010 = 9.9 (99% of potential)
 100 processors
 Time = 10 × t
add + 10000/100 × tadd = 110 × tadd

Speedup = 10010/110 = 91 (91% of potential)


 Assuming load balanced

Conclusion:
 As matrix size increases, the parallelizable work dominates the total.
 This means the serial part (10 scalar additions) becomes less significant.
 So, the speedup gets closer to ideal as the problem size grows — this is
known as better scalability.
 Key insight: Large workloads benefit more from parallelism.
Chapter 6 — Parallel Processors from Client to Cloud — 12
Strong vs Weak Scaling
 Strong Scaling:
 Fixed problem size, increase processors
 Goal: reduce time as you add processors
 Example: Original 10 scalars + matrix work
 Weak Scaling:
 Problem size grows with number of processors
 Goal: keep time constant
 Example:

10 processors, 10×10 matrix → Time = 20 × tadd

100 processors, 32×32 matrix (≈ 1000 elements) → Time = 20 × tadd

Conclusion:
 In weak scaling, if load is balanced well, the performance stays
constant as we scale up the problem and processors .

Chapter 6 — Parallel Processors from Client to Cloud — 13


§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
 An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
 SPMD: Single Program Multiple Data
 One program runs on multiple processors
 Each processor gets different data
 Processors may do different tasks using conditional statements (e.g., if
statements) inside the same program
 So, it's not multiple programs — just one program that behaves differently
on each processor.
Chapter 6 — Parallel Processors from Client to Cloud — 14
Vector Processors
A vector processor is a type of CPU designed to handle
vector operations — that means it can process many data
elements at once, instead of one at a time.

Example:
A regular processor (scalar processor) adds two numbers

like this:
5+3=8
A vector processor can add two lists (vectors) of numbers

like this, all at once:


[5, 10, 15] + [3, 6, 9] = [8, 16, 24]

Chapter 6 — Parallel Processors from Client to Cloud — 15


Vector Processors
 Highly pipelined function units- Like an assembly line — different
stages of a calculation happen at the same time, making it faster.
 Stream data from/to vector registers to units
 Data collected from memory into registers

 Results stored from registers to memory

 Example: Vector extension to MIPS


 32 × 64-element registers (64-bit elements)

 Vector instructions allow working on many values at once


lv, sv: load/store vector

addv.d: add vectors of double

addvs.d: add scalar to each element of vector of double
 Significantly reduces instruction-fetch bandwidth - fewer
instructions are needed since one vector instruction can handle many data
items at once — saving time and memory use.
Chapter 6 — Parallel Processors from Client to Cloud — 16
Example: DAXPY (Y = a × X + Y)
 Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;upper bound of what to load
loop: l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done
 Vector MIPS code

l.d $f0,a($sp) ;load scalar a


lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result

Chapter 6 — Parallel Processors from Client to Cloud — 17


Vector vs. Scalar
 Scalar processing handles one data item at a time.
 Vector processing handles multiple data items (a vector) at once — this is
great for tasks that process arrays or lists of data (like images or simulations).
Advantages of Vector Architectures:
1. Simplify data-parallel programming
 Easier to write code that does the same operation on many pieces of data at once.
2. Explicitly state there are no loop-carried dependencies
 The programmer/compiler tells the system that one loop step doesn’t depend on the
previous.
 This reduces the amount of checking the hardware must do.
3. Efficient memory access
 Vectors access memory in regular patterns, which allows the use of interleaved or
burst memory, making data transfer faster.
4. Avoid control hazards
 Since loops can be replaced by vector operations, the risk of branching errors or
delays is reduced.
5. More general than MMX/SSE (older media extensions)
 Vector architectures are more flexible and work better with modern compilers than
these older, special-purpose systems.
SIMD
SIMD stands for Single Instruction, Multiple Data.
This is a way of processing where the same instruction (command) is applied to
many pieces of data at the same time — instead of one at a time.
Example:
If you have a list of numbers, and you want to add 5 to each number.
Without SIMD (scalar processing):

The computer adds 5 to each number one by one.


With SIMD:

Then computer adds 5 to many numbers at once using a single instruction.


So Instead of:
Add 5 to number 1, Add 5 to number 2, Add 5 to number 3 ...
We do:
Add 5 to numbers 1–4 at the same time

Benefits of SIMD:
Faster performance (less time to process big data sets)

Simpler code for repeated tasks

Less power and hardware usage compared to doing everything one by one
SIMD – Cont.
Key Points:
Operate on vectors (lists of data), not just one number at a time

Example: MMX and SSE let CPUs work on multiple values at once using
128-bit registers (like doing math on 4 numbers in one step).
All processors run the same instruction at the same time,

but each may work on different pieces of data (e.g., different memory
locations).
Easier to synchronize

Since everyone is doing the same thing, it's simpler to keep everything in
sync.
Less hardware control is needed

Only one instruction is sent to all processors — this simplifies the system.
Best for data-parallel tasks

Works really well when doing the same operation on lots of data, like in:

Image processing

Audio/video

Scientific simulations
SIMD – Cont.
MMX (MultiMedia eXtensions)
Introduced by Intel in the 1990s.
It added special instructions to the CPU to speed up multimedia tasks, like:


Image processing

Video playback

Audio compression
What it does:
It allows the CPU to process multiple pieces of data at once (SIMD = Single
Instruction, Multiple Data). For example, it can apply the same math operation to 4
pixels at the same time.
SSE (Streaming SIMD Extensions)
Also developed by Intel, as an improvement over MMX.
SSE is faster and more flexible.
It supports floating-point numbers, which MMX did not (important for things like 3D

graphics and physics simulations).


What it adds:
More instructions for math, logic, and memory operations
Ability to handle 128-bit registers, meaning it can process even more data in parallel
Vector vs. Multimedia Extensions
1. Vector instructions have a variable vector width, multimedia
extensions have a fixed width
 Vector instructions can work with different lengths of data (e.g., 4, 8,

or 16 elements).
 Multimedia extensions like MMX/SSE always use a fixed size, such

as 128 bits.
This makes vector architectures more flexible for different data sizes.

2. Vector instructions support strided access, multimedia extensions


do not
 Strided access means you can access elements in memory with a gap (stride)
between them.
Example: Access every 3rd number in a list.
 Multimedia extensions can only access continuous data.

Vector systems can handle more complex memory patterns.


Vector vs. Multimedia Extensions
3. Vector hardware can be built using:

Pipelining: breaking operations into stages (like an assembly line).

Arrayed units: doing multiple operations in parallel (like multiple workers
doing the same task).
Left Side (Element Groups):
 Shows vector operations where multiple elements (e.g., A[i] + B[i]) are processed.
 Data is grouped and processed in parallel, forming output vectors like C[0], C[1], ....
Right Side (Vector Hardware Layout):
 The vector processor is divided into lanes (Lane 0 to Lane 4).
 Each lane handles a different group of vector elements.
 Inside each lane, there are:

Pipelined units for addition and multiplication (e.g., FP add pipe 0, FP mul pipe 0)

Vector registers storing specific element positions (e.g., Lane 0 handles elements 0, 4, 8...)
§6.4 Hardware Multithreading
Multithreading
What is a Thread?
A thread is the smallest unit of work that a computer can run inside a
program.
It’s like a mini-task that helps a program do multiple things at once.
Example:
Imagine a web browser:
One thread loads a web page

Another thread plays a video

Another listens for your mouse or keyboard input

All of these can happen at the same time—thanks to threads.

"Think of a thread like a worker in a company. If a program is a company,


threads are the workers doing different jobs at the same time to get things
done faster."

Chapter 6 — Parallel Processors from Client to Cloud — 24


§6.4 Hardware Multithreading
Multithreading
 Performing multiple threads of execution in
parallel
 Replicate registers, PC, etc.
 Fast switching between threads
 Fine-grain multithreading
 Switch threads after each cycle
 Interleave instruction execution
 If one thread stalls, others are executed
 Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)

Chapter 6 — Parallel Processors from Client to Cloud — 25


Simultaneous Multithreading
Multiple-Issue Dynamically Scheduled Processor
This type of processor can run more than one instruction at the same

time.
It can choose instructions from different threads, depending on which

resources (like ALUs) are free.


If instructions from different threads don’t depend on each other, they

can run in parallel.


Inside a thread, the processor uses scheduling and register renaming to

avoid problems caused by instruction dependencies.

Example: Intel Pentium-4 HT (Hyper-Threading)


It can run two threads at once.

Each thread has its own registers, but shares function units and cache

with the other thread.


This helps use the CPU more efficiently when one thread is waiting or

stalled.

Chapter 6 — Parallel Processors from Client to Cloud — 26


Multithreading Example
• Each thread (A, B, C, or D) is running on its own
— not sharing the processor with other threads at
the same time.
• The processor is working on only one thread at a
time.
• It doesn't switch quickly between threads.
• It doesn't combine instructions from different
threads.

Coarse-Grain Multithreading (Coarse MT):


•Runs one thread for a while, then switches to another
only if there's a long delay (e.g., memory wait).
•Big blocks of the same color, meaning one thread runs
for longer before switching.
Fine-Grain Multithreading (Fine MT):
•Switches threads every cycle.
•Instructions from different threads are interleaved
(mixed), improving CPU use.
Simultaneous Multithreading (SMT):
•Runs instructions from multiple threads at the same
time in the same cycle.
•Most efficient — uses all available issue slots by
combining instructions from many threads.
Future of Multithreading
 Will multithreading continue? If so, how might
it change?
 Power limits are pushing designers to use
simpler processor designs
 This leads to basic versions of multithreading
 To handle delays from cache misses,
 Switching threads might be the best solution
 Using many simple cores could help share
resources better

Chapter 6 — Parallel Processors from Client to Cloud — 28


§6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory
 SMP: shared memory multiprocessor

In an SMP system, all processors share the same memory.

The hardware gives one shared address space, so all CPUs see the same memory.

To prevent problems when many CPUs access the same variable, we use locks to
synchronize them.

Memory access time can be:

UMA (Uniform): All processors access memory at the same speed.

NUMA (Non-Uniform): Some processors can access certain memory faster than others (based on
location).

• Processors: Multiple CPUs are working


in parallel.
• Each processor has its own cache to
store frequently used data.
• All processors are connected through an
interconnection network (like a bus or
switch).
• This network links the CPUs to shared
memory and I/O devices.
“Think of each processor as a worker with their own desk
(cache), but they all go to the same filing cabinet (memory)
through the same hallway (interconnection network).”
§6.5 Multicore and Other Shared Memory Multiprocessors
Why Use Parallelism in Shared Memory Systems?

 Shared memory multiprocessors like UMA systems allow


multiple processors to work together efficiently.
 Suppose we need to sum 64,000 numbers — doing this on
one processor would be slow.
 With 64 processors, we can divide the work so each
processor handles 1,000 numbers in parallel.
 Each processor computes a partial sum.
 We then use a technique called reduction to combine the
partial results into a final total.
 This shows how parallel programming and shared
memory can greatly speed up computations.
Example: Sum Reduction
 Goal: Sum 64,000 numbers using 64 processors
 Step 1: Split the Work
 Each processor has an ID: 0 ≤ Pn ≤ 63
 Divide data evenly: 1000 numbers per processor
 Step 2: Local Summation (on each processor)
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i += 1)
sum[Pn] += A[i];
 Each processor computes a partial sum of its portion.
 Step 3: Combine the Results (Reduction)
 Use a divide and conquer approach:

First, half the processors add pairs of results.

Then, a quarter of them add those results, and so on.
 Synchronization is needed between reduction steps.

Chapter 6 — Parallel Processors from Client to Cloud — 31


Example: Sum Reduction

half = 64;
do
synch();
if (half%2 != 0 && Pn == 0)
sum[0] += sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] += sum[Pn+half];
while (half > 1);

Chapter 6 — Parallel Processors from Client to Cloud — 32


§6.6 Introduction to Graphics Processing Units
History of GPUs
 A GPU (Graphics Processing Unit) is a specialized processor designed to
quickly perform complex calculations needed to create and display images,
especially for 3D graphics, video, and gaming.
3D Graphics Processing
 Initially used in expensive, high-end computers (e.g., SGI workstations).
 Moore’s Law (more power, lower cost over time) made 3D graphics
hardware cheaper and more accessible.
 As a result, 3D graphics cards became common in personal computers and
game consoles.
Graphics Processing Units (GPUs)
 Special processors built specifically for handling 3D graphics.
 They perform tasks like:
 Processing shapes (vertex)
 Coloring pixels
 Adding textures
 Converting images into pixels (rasterization)

Chapter 6 — Parallel Processors from Client to Cloud — 33


Graphics in the System
Key Points:
•GPU connects to the CPU via the PCI-Express
bus.
•GPU Memory is separate from system
memory.
•Different setups for Intel and AMD CPUs.
•Older systems used VGA controllers and
framebuffers for display output.
•Modern systems use dedicated GPUs for
faster graphics performance.
GPU Architectures
 Processing is highly data-parallel

GPUs are highly multithreaded

Use thread switching to hide memory latency

Less reliance on multi-level caches

Graphics memory is wide and high-bandwidth
 Trend toward general purpose GPUs

Heterogeneous CPU/GPU systems

CPU for sequential code, GPU for parallel code
 Programming languages/APIs

DirectX, OpenGL – For building graphics applications.

C for Graphics (Cg), High Level Shader Language (HLSL)
For writing shaders (small GPU programs).

Compute Unified Device Architecture (CUDA) Allows
programmers to use GPUs for general-purpose tasks, not
just graphics.-
Chapter 6 — Parallel Processors from Client to Cloud — 35
Example: NVIDIA Tesla
 Multiple SIMD processors, each as shown:

Chapter 6 — Parallel Processors from Client to Cloud — 36


Example: NVIDIA Tesla
 The SIMD Processor contains 16 SIMD lanes (shown as 16 vertical
columns in the diagram), each acting as a thread processor.
 Each lane has:
 Its own registers (1K × 32), and
 A load/store unit for memory access.
How SIMD Instructions Work
 Each SIMD instruction works on a 32-element thread (i.e., 32 parallel
threads).
 Since there are only 16 lanes, the processor runs the instruction in
two cycles to cover all 32 elements.
Register Usage
 The system has a total of 32K × 32-bit registers, shared across lanes.
 Each thread gets 64 registers for its context (variables, temporary
storage, etc.).
GPU Memory Structures
CUDA’s memory hierarchy
and execution model in simple
terms:
•Each CUDA thread has its
own private memory.
•hreads are grouped into
blocks, sharing local (shared)
memory.
•Blocks form grids, which
access global GPU memory.
•Multiple grids run in
sequence, with inter-grid
synchronization when needed.

Chapter 6 — Parallel Processors from Client to Cloud — 38


Classifying GPUs
 GPUs don’t fit perfectly into standard models like SIMD (Single Instruction,
Multiple Data) or MIMD (Multiple Instruction, Multiple Data).
 They can appear like MIMD because threads can run different code
(conditional execution),
but this may cause performance issues.
Need to write general purpose code with care

Parallelism Static: Discovered Dynamic: Discovered


Models at Compile Time at Runtime
Instruction-Level VLIW (Very Long Superscalar
Parallelism Instruction Word)
Data-Level SIMD or Vector Tesla Multiprocessor
Parallelism

Chapter 6 — Parallel Processors from Client to Cloud — 39


Putting GPUs into Perspective
Feature Multicore with SIMD GPU
SIMD processors 8 to 24 15 to 80
SIMD lanes/processor 2 to 4 8 to 16
Multithreading hardware support for 2 to 4 16 to 32
SIMD threads
Typical ratio of single precision to 2:1 2:1
double-precision performance
Largest cache size 48 MB 6 MB
Size of memory address 64-bit 64-bit
Size of main memory 64 GB to 1024 GB 4 GB to 16 GB
Memory protection at level of page Yes Yes
Demand paging Yes No
Cache coherent Yes No

Chapter 6 — Parallel Processors from Client to Cloud — 40


Guide to GPU Terms

Chapter 6 — Parallel Processors from Client to Cloud — 41


PART 2

Chapter 2 — Instructions: Language of the Computer — 42


§6.8 Clusters, WSC, and Other Message-Passing MPs
Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Chapter 6 — Parallel Processors from Client to Cloud — 43


§6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Chapter 6 — Parallel Processors from Client to Cloud — 44


Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system

E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost (prefer virtual machines)
 Low interconnect bandwidth

c.f. processor/memory bandwidth on an SMP

Chapter 6 — Parallel Processors from Client to Cloud — 45


Sum Reduction (Again)
 Sum 64,000 on 64 processors
 First distribute 1000 numbers to each
 The do partial sums
sum = 0;
for (i = 0; i<1000; i += 1)
sum += AN[i];
 Reduction
 Half the processors send, other half receive
and add
 The quarter send, quarter receive and add, …
Chapter 6 — Parallel Processors from Client to Cloud — 46
Sum Reduction (Again)
 Given send() and receive() operations
limit = 64; half = 64;/* 64 processors */
do
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum += receive();
limit = half; /* upper limit of senders */
while (half > 1); /* exit with final sum */
 Send/receive also provide synchronization
 Assumes send/receive take similar time to addition

Chapter 6 — Parallel Processors from Client to Cloud — 47


Grid Computing
 Separate computers interconnected by
long-haul networks
 E.g., Internet connections
 Work units farmed out, results sent back
 Can make use of idle time on PCs
 E.g., SETI@home, World Community Grid

Chapter 6 — Parallel Processors from Client to Cloud — 48


§6.9 Introduction to Multiprocessor Network Topologies
Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

Chapter 6 — Parallel Processors from Client to Cloud — 49


Multistage Networks

Chapter 6 — Parallel Processors from Client to Cloud — 50


Network Characteristics
 Performance
 Latency per message (unloaded network)
 Throughput

Link bandwidth

Total network bandwidth

Bisection bandwidth
 Congestion delays (depending on traffic)
 Cost
 Power
 Routability in silicon

Chapter 6 — Parallel Processors from Client to Cloud — 51


§6.11 Multiprocessor Benchmarks and Performance Models
Parallel Benchmarks
 Linpack: matrix linear algebra
 SPECrate: parallel run of SPEC CPU programs
 Job-level parallelism
 SPLASH: Stanford Parallel Applications for
Shared Memory
 Mix of kernels and applications, strong scaling
 NAS (NASA Advanced Supercomputing) suite
 computational fluid dynamics kernels
 PARSEC (Princeton Application Repository for
Shared Memory Computers) suite
 Multithreaded applications using Pthreads and
OpenMP
Chapter 6 — Parallel Processors from Client to Cloud — 52
Code or Applications?
 Traditional benchmarks
 Fixed code and data sets
 Parallel programming is evolving
 Should algorithms, programming languages,
and tools be part of the system?
 Compare systems, provided they implement a
given application
 E.g., Linpack, Berkeley Design Patterns
 Would foster innovation in approaches to
parallelism

Chapter 6 — Parallel Processors from Client to Cloud — 53


Modeling Performance
 Assume performance metric of interest is
achievable GFLOPs/sec
 Measured using computational kernels from
Berkeley Design Patterns
 Arithmetic intensity of a kernel
 FLOPs per byte of memory accessed
 For a given computer, determine
 Peak GFLOPS (from data sheet)
 Peak memory bytes/sec (using Stream
benchmark)

Chapter 6 — Parallel Processors from Client to Cloud — 54


Roofline Diagram

Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Chapter 6 — Parallel Processors from Client to Cloud — 55


Roofline Diagram

Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Chapter 6 — Parallel Processors from Client to Cloud — 56


Comparing Systems
 Example: Opteron X2 vs. Opteron X4
 2-core vs. 4-core, 2× FP performance/core, 2.2GHz
vs. 2.3GHz
 Same memory system

 To get higher performance


on X4 than X2
 Need high arithmetic intensity
 Or working set must fit in X4’s
2MB L-3 cache

Chapter 6 — Parallel Processors from Client to Cloud — 57


Optimizing Performance
 Optimize FP performance
 Balance adds & multiplies
 Improve superscalar ILP
and use of SIMD
instructions
 Optimize memory usage
 Software prefetch

Avoid load stalls
 Memory affinity

Avoid non-local data
accesses

Chapter 6 — Parallel Processors from Client to Cloud — 58


Optimizing Performance
 Choice of optimization depends on
arithmetic intensity of code
 Arithmetic intensity is
not always fixed
 May scale with
problem size
 Caching reduces
memory accesses

Increases arithmetic
intensity
Chapter 6 — Parallel Processors from Client to Cloud — 59
TPUv3 vs Volta for DNN

Chapter 6 — Parallel Processors from Client to Cloud — 60


TPUv3
 Core Sequencer:
 VLIW with software-managed memory

322-bit VLIW w/8 operations:
 2 x scalar ALU, 2 x vector ALU, vector load and store, 2 x
queue operations for matrix multiply/transpose unit

Chapter 6 — Parallel Processors from Client to Cloud — 61


TPUv3
 Vector Processing Unit (VPU)
 Uses data-level parallelism (2D matrix and vector functional
units) and instruction-level parallelism (8 operations per
instruction)
 Uses on-chip vector memory (Vmem) with 32K 128 x 32-bit
elements (16 MiB)
 32 2D vector registers (Vregs) that each contain 128 x 8 32-bit
elements (4 KiB)
 MXU
 Produces 32-bit FP products from 16-bit FP inputs that
accumulate in 32 bits
 Two MXUs per TensorCore

 The Transpose Reduction Permute Unit


 128x128 matrix transposes, reductions, and permutations

Chapter 6 — Parallel Processors from Client to Cloud — 62


TPUv3 vs Volta for DNN

Chapter 6 — Parallel Processors from Client to Cloud — 63


Speedup of TPUv3 vs Volta

Chapter 6 — Parallel Processors from Client to Cloud — 64


TPUv3 and Volta Scalability

Chapter 6 — Parallel Processors from Client to Cloud — 65


§6.13 Going Faster: Multiple Processors and Matrix Multiply
Multi-threading DGEMM
 Use OpenMP:
1. #include <x86intrin.h>
2. #define UNROLL (4)
3. #define BLOCKSIZE 32
4. void do_block (int n, int si, int sj, int sk,
5. double *A, double *B, double *C)
6. {
7. for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*8 )
8. for ( int j = sj; j < sj+BLOCKSIZE; j++ ) {
9. __m512d c[UNROLL];
10. for (int r=0;r<UNROLL;r++)
11. c[r] = _mm512_load_pd(C+i+r*8+j*n); //[ UNROLL];
12.
13. for( int k = sk; k < sk+BLOCKSIZE; k++ )
14. {
15. __m512d bb = _mm512_broadcastsd_pd(_mm_load_sd(B+j*n+k));
16. for (int r=0;r<UNROLL;r++)
17. c[r] = _mm512_fmadd_pd(_mm512_load_pd(A+n*k+r*8+i), bb, c[r]);
18. }
19.
20. for (int r=0;r<UNROLL;r++)
21. _mm512_store_pd(C+i+r*8+j*n, c[r]);
22. }
23. }
24.
25. void dgemm (int n, double* A, double* B, double* C)
26. {
27. #pragma omp parallel for
28. for ( int sj = 0; sj < n; sj += BLOCKSIZE )
29. for ( int si = 0; si < n; si += BLOCKSIZE )
30. for ( int sk = 0; sk < n; sk += BLOCKSIZE )
31. do_block(n, si, sj, sk, A, B, C);
32. }

Chapter 6 — Parallel Processors from Client to Cloud — 66


Multithreaded DGEMM

Chapter 6 — Parallel Processors from Client to Cloud — 67


Multithreaded DGEMM

Chapter 6 — Parallel Processors from Client to Cloud — 68


§6.14 Fallacies and Pitfalls
Fallacies
 Amdahl’s Law doesn’t apply to parallel
computers
 Since we can achieve linear speedup
 But only on applications with weak scaling
 Peak performance tracks observed
performance
 Marketers like this approach!
 But compare Xeon with others in example
 Need to be aware of bottlenecks

Chapter 6 — Parallel Processors from Client to Cloud — 69


Fallacies
 Not developing the software to take
advantage of, or optimize for, a novel
architecture
 Unexpected bottlenecks, e.g. serialization of
page tables
 Usability for DSAs
 You can get good vector performance
without proving memory bandwidth
 Beware of the sloping part of the roofline

Chapter 6 — Parallel Processors from Client to Cloud — 70


Pitfalls
 Not developing the software to take
account of a multiprocessor architecture
 Example: using a single lock for a shared
composite resource

Serializes accesses, even if they could be done in
parallel

Use finer-granularity locking

Chapter 6 — Parallel Processors from Client to Cloud — 71


Pitfalls
 Assuming the ISA completely hides the
physical implementation properties
 Attacker can examine state changes caused
by instructions that are rolled back or
performance differences caused by
intermixing of instructions from different
programs on the same server

Speculation

Caching

Hardware multithreading

Chapter 6 — Parallel Processors from Client to Cloud — 72


§6.15 Concluding Remarks
Concluding Remarks
 Goal: higher performance by using multiple
processors
 Difficulties
 Developing parallel software
 Devising appropriate architectures
 SaaS importance is growing and clusters are a
good match
 Performance per dollar and performance per
Joule drive both mobile and WSC

Chapter 6 — Parallel Processors from Client to Cloud — 73


Concluding Remarks (con’t)
 SIMD and vector
operations match
multimedia applications
and are easy to
program

Chapter 6 — Parallel Processors from Client to Cloud — 74

You might also like