Patterson6e_MIPS_Ch06_PPT(2) (1)
Patterson6e_MIPS_Ch06_PPT(2) (1)
6
Edition
th
Chapter 6
Parallel Processors from
Client to Cloud
PART 1
Conclusion:
As matrix size increases, the parallelizable work dominates the total.
This means the serial part (10 scalar additions) becomes less significant.
So, the speedup gets closer to ideal as the problem size grows — this is
known as better scalability.
Key insight: Large workloads benefit more from parallelism.
Chapter 6 — Parallel Processors from Client to Cloud — 12
Strong vs Weak Scaling
Strong Scaling:
Fixed problem size, increase processors
Goal: reduce time as you add processors
Example: Original 10 scalars + matrix work
Weak Scaling:
Problem size grows with number of processors
Goal: keep time constant
Example:
10 processors, 10×10 matrix → Time = 20 × tadd
100 processors, 32×32 matrix (≈ 1000 elements) → Time = 20 × tadd
Conclusion:
In weak scaling, if load is balanced well, the performance stays
constant as we scale up the problem and processors .
Example:
A regular processor (scalar processor) adds two numbers
like this:
5+3=8
A vector processor can add two lists (vectors) of numbers
lv, sv: load/store vector
addv.d: add vectors of double
addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth - fewer
instructions are needed since one vector instruction can handle many data
items at once — saving time and memory use.
Chapter 6 — Parallel Processors from Client to Cloud — 16
Example: DAXPY (Y = a × X + Y)
Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;upper bound of what to load
loop: l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done
Vector MIPS code
Benefits of SIMD:
Faster performance (less time to process big data sets)
Less power and hardware usage compared to doing everything one by one
SIMD – Cont.
Key Points:
Operate on vectors (lists of data), not just one number at a time
Example: MMX and SSE let CPUs work on multiple values at once using
128-bit registers (like doing math on 4 numbers in one step).
All processors run the same instruction at the same time,
but each may work on different pieces of data (e.g., different memory
locations).
Easier to synchronize
Since everyone is doing the same thing, it's simpler to keep everything in
sync.
Less hardware control is needed
Only one instruction is sent to all processors — this simplifies the system.
Best for data-parallel tasks
Works really well when doing the same operation on lots of data, like in:
Image processing
Audio/video
Scientific simulations
SIMD – Cont.
MMX (MultiMedia eXtensions)
Introduced by Intel in the 1990s.
It added special instructions to the CPU to speed up multimedia tasks, like:
Image processing
Video playback
Audio compression
What it does:
It allows the CPU to process multiple pieces of data at once (SIMD = Single
Instruction, Multiple Data). For example, it can apply the same math operation to 4
pixels at the same time.
SSE (Streaming SIMD Extensions)
Also developed by Intel, as an improvement over MMX.
SSE is faster and more flexible.
It supports floating-point numbers, which MMX did not (important for things like 3D
or 16 elements).
Multimedia extensions like MMX/SSE always use a fixed size, such
as 128 bits.
This makes vector architectures more flexible for different data sizes.
time.
It can choose instructions from different threads, depending on which
Each thread has its own registers, but shares function units and cache
stalled.
half = 64;
do
synch();
if (half%2 != 0 && Pn == 0)
sum[0] += sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] += sum[Pn+half];
while (half > 1);
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )