0% found this document useful (0 votes)
7 views

Analysis

High speed computing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Analysis

High speed computing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Performance Metrics, Prediction, and Measurement

CPS343

Parallel and High Performance Computing

Spring 2016

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 1 / 32
Outline

1 Analyzing Parallel Programs


Speedup and Efficiency
Amdahl’s Law and Gustafson-Barsis’s Law
Evaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 2 / 32
Acknowledgements

Some material used in creating these slides comes from


Multicore and GPU Programming: An Integrated Approach,
Gerassimos Barlas, Morgan Kaufman/Elsevier, 2015.
Introduction to High Performance Scientific Computing, Victor
Eijkhout, 2015.
Parallel Programming in C with MPI and OpenMP, Michael Quinn,
McGraw-Hill, 2004.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 3 / 32
Outline

1 Analyzing Parallel Programs


Speedup and Efficiency
Amdahl’s Law and Gustafson-Barsis’s Law
Evaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 4 / 32
Speedup

Speedup is defined as
sequential execution time tseq
Speedup on N processors = =
execution time on N processors tpar

Generally we want to use execution times obtained using the best


available algorithm. The best algorithm for a sequential program may
be different that the best algorithm for a parallel program.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 5 / 32
Efficiency

Efficiency is defined as
speedup tseq
Efficiency = =
N N · tpar

Given this definition, we expect

0 ≤ efficiency ≤ 1

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 6 / 32
Speedup

Quinn uses ψ(n, N) for speedup to show it is a function of both


problem size n and number of processors N.
σ(n): time required computation that is not parallelizable
ϕ(n): time required for computation that is parallelizable
κ(n, N): time for parallel overhead (communication, barriers, etc.)
Note that
tseq = σ(n) + ϕ(n)
tpar = σ(n) + ϕ(n)/N + κ(n, N)
Using these parameters, the speedup ψ(n, N) given by:

σ(n) + ϕ(n)
ψ(n, N) =
σ(n) + ϕ(n)/N + κ(n, N)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 7 / 32
Speedup

Since κ(n, N) ≥ 0, if it is dropped (i.e. assume there is no parallel


overhead), we obtain an upper bound on speedup

σ(n) + ϕ(n)
ψ(n, N) ≤
σ(n) + ϕ(n)/N

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 8 / 32
Outline

1 Analyzing Parallel Programs


Speedup and Efficiency
Amdahl’s Law and Gustafson-Barsis’s Law
Evaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 9 / 32
Amdahl’s Law

First appearing in a paper by Gene Amdahl in 1967, this provides an


upper bound on achievable speedup based on fraction of computation
that can be done in parallel.

Suppose T is the time needed for an application to execute on a


single CPU. Suppose also that α is the fraction of the computation
that can be done in parallel so that 1 − α is the fraction that must be
carried out on a single CPU. Then, ignoring parallel overhead, we have
tseq T
ψ(N) = ≤
tpar (1 − α)T + α T
N

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 10 / 32
Amdahl’s Law

Simplifying we have
tseq 1
ψ(N) = ≤ α
tpar (1 − α) + N

This is Amdal’s Law. Note that is says something rather discouraging:


Even as the number of processors increases without bound, speedup is
bounded by
1
ψ≤ .
1−α
For example, if 90% of the computation can be parallelized so α = 0.9,
the speedup cannot be larger than 1/0.1 = 10 regardless of the number of
processors.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 11 / 32
Amdahl’s Law speedup prediction

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 12 / 32
Amdahl’s Law efficiency prediction

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 13 / 32
Amdahl’s Law example
Suppose a serial program reads n data from a file, performs some
computation, and then writes n data back out to another file. The I/O
time is measured and found to be 4500 + n µsec. If the computation
portion takes n2 /200 µsec, what is the maximum speedup we can expect
when n=10,000 and p processors are used?

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 14 / 32
Amdahl’s Law example
Suppose a serial program reads n data from a file, performs some
computation, and then writes n data back out to another file. The I/O
time is measured and found to be 4500 + n µsec. If the computation
portion takes n2 /200 µsec, what is the maximum speedup we can expect
when n=10,000 and p processors are used?
We assume that the I/O must be done serially but that the computation
can be parallelized. Computing α we find
500000 5000
f = = ≈ 0.97182
4500 + 10000 + 500000 5145
so, by Amdahl’s Law,
1 5145
ψ≤ 5000
 5000
=
1− 5145 + 5145N
145 + 5000/N

This gives a maximum speedup of 6.68 on 8 processors and 11.27 on 16


processors.
CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 14 / 32
The Gustafson-Barsis Law

Amdahl’s Law assumes the mind set


“We have a sequential program and want to figure out what
speedup is attainable by parallelizing as much of it as possible.”

It turns out that focusing on parallelizing a sequential program is not a


good way to think of this because often parallel implementations of
solutions may look very different than the serial solution.
A more useful way of thinking starts with a parallel program and asks
“We have a parallel program and want to figure out how much
faster it is than a sequential program doing the same work.”

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 15 / 32
The Gustafson-Barsis Law

Amdahl’s law focuses on speedup as a function of increasing the


number of processors; i.e., “how much faster can we get a fixed
amount of work done using N processors?” Sometimes the question is
“how much more work can we get done in a fixed amount of time
using N processors?”
Let T be the total time a parallel program requires when using N
processors. As before, let 0 ≤ α ≤ 1 be the fraction of execution time
the program spends executing code in parallel. Then

tseq = (1 − α)T + α · T · N

so
tseq (1 − α)T + α · T · N
ψ≤ = = (1 − α) + αN
tpar T
This is the Gustafson-Barsis Law (1988).
The speedup estimate it produces is sometimes called scaled speedup.
CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 16 / 32
Gustafson-Barsis Law speedup prediction

This is much more encouraging than what Amdahl’s Law showed us.
CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 17 / 32
Gustafson-Barsis Law efficiency prediction

Again, this is much more encouraging!


CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 18 / 32
Gustafson-Barsis’s Law example

A parallel program takes 134 seconds to run on 32 processors. The total


time spent in the sequential part of the program was 12 seconds. What is
the scaled speedup?

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 19 / 32
Gustafson-Barsis’s Law example

A parallel program takes 134 seconds to run on 32 processors. The total


time spent in the sequential part of the program was 12 seconds. What is
the scaled speedup?
Here α = (134 − 12)/134 = 122/134 so the scaled speedup is
 
122 122
(1 − α) + αN = 1 − + · 32 = 29.224
134 134

This means that the program is running approximately 29 times faster


than the program would run on one processor..., assuming it could run on
one processor.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 19 / 32
The laws compared

The Wikipedia page for Gustafson’s Law offers the following metaphor to
contrast the two laws.

Amdahl’s Law approximately suggests: Suppose a car is traveling


between two cities 60 miles apart, and has already spent one hour traveling
half the distance at 30 mph. No matter how fast you drive the last half, it
is impossible to achieve 90 mph average before reaching the second city.
Since it has already taken you 1 hour and you only have a distance of 60
miles total; going infinitely fast you would only achieve 60 mph.

Gustafson-Barsis’s Law approximately suggests: Suppose a car has


already been traveling for some time at less than 90mph. Given enough
time and distance to travel, the car’s average speed can always eventually
reach 90mph, no matter how long or how slowly it has already traveled.
For example, if the car spent one hour at 30 mph, it could achieve this by
driving at 120 mph for two additional hours, or at 150 mph for an hour.
CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 20 / 32
Outline

1 Analyzing Parallel Programs


Speedup and Efficiency
Amdahl’s Law and Gustafson-Barsis’s Law
Evaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 21 / 32
Speedup revisited

The above metrics ignore communication and other parallel overhead.


When evaluating an approach to parallelizing a task, we next should
include estimates of communication costs, which can dominate
parallel program time.
Let ts be the time for a sequential version of the program and tp be
the time for the parallel algorithm on p processors. Then speedup is
ts /tp .

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 22 / 32
Parallel Execution Time

Parallel execution time tp can be broken down into two parts,


computation time tcomp and communication time tcomm .

tp = tcomp + tcomm

Speedup is then
ts ts
ψ= =
tp tcomp + tcomm
The computation/communication ratio is tcomp /tcomm .

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 23 / 32
Message transfer time
Typically the time for communication can be broken down into two parts,
the time tstartup necessary for building the message and initiating the
transfer, and the time tdata required per data item in the message. At a
first approximation this looks like
t

tstartup

where m is the number of data items sent.


CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 24 / 32
A communication timing experiment (Fall 2010)

The values of tstartup and tdata can be determined empirically


Test program written that sends messages ranging in length from 100
to 10000 integers between two nodes
Each message sent (and received) and then sent back (and received)
Repeated 100 times
Linear regression used to fit line to timing data
Determined tstartup = 88.25 µsec and tdata = 0.0415 µsec per integer,
which corresponds to 96.43 MB/s (assuming 4 bytes per integer)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 25 / 32
Workstation cluster timing data (Fall 2010)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 26 / 32
Workstation cluster timing data (Spring 2016)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 27 / 32
Minor Prophets cluster latency and bandwidth (2016)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 28 / 32
Canaan cluster latency and bandwidth (2016)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 29 / 32
LittleFe cluster latency and bandwidth (2013)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 30 / 32
Cluster latency and bandwidth comparison
Minor Prophets Canaan

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 31 / 32
Cluster latency and bandwidth comparison

Minor Prophets Canaan

Error bars show 1 standard deviation in data values averaged to produce


plots.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2016 32 / 32

You might also like