0% found this document useful (0 votes)
121 views

CSCI 8150 Advanced Computer Architecture

This document discusses various metrics for measuring parallel computing performance, including: - Degree of parallelism (DOP), which measures the number of processors used at a given time. A parallelism profile plots DOP over time. - Available parallelism in programs can be high in theory but lower in practice due to limitations of real machines. - Metrics like speedup, efficiency, redundancy, and quality aim to characterize how effectively a parallel system utilizes its processors. - Standard benchmarks like Dhrystone, Whetstone and TPS (transactions per second) are used to compare overall system performance.

Uploaded by

Rishabh Panday
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

CSCI 8150 Advanced Computer Architecture

This document discusses various metrics for measuring parallel computing performance, including: - Degree of parallelism (DOP), which measures the number of processors used at a given time. A parallelism profile plots DOP over time. - Available parallelism in programs can be high in theory but lower in practice due to limitations of real machines. - Metrics like speedup, efficiency, redundancy, and quality aim to characterize how effectively a parallel system utilizes its processors. - Standard benchmarks like Dhrystone, Whetstone and TPS (transactions per second) are used to compare overall system performance.

Uploaded by

Rishabh Panday
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

CSCI 8150

Advanced Computer Architecture


Hwang, Chapter 3
Principles of Scalable Performance
3.1 Performance Metrics and Measures
Degree of Parallelism
The number of processors used at any instant to
execute a program is called the degree of
parallelism (DOP); this can vary over time.
DOP assumes an infinite number of processors are
available; this is not achievable in real machines, so
some parallel program segments must be executed
sequentially as smaller parallel segments. Other
resources may impose limiting conditions.
A plot of DOP vs. time is called a parallelism profile.

Example Parallelism Profile
Time
DOP
Average
Parallelism
t
1
t
2
Average Parallelism - 1
Assume the following:
n homogeneous processors
maximum parallelism in a profile is m
Ideally, n >> m
A, the computing capacity of a processor, is something
like MIPS or Mflops w/o regard for memory latency, etc.
i is the number of processors busy in an observation
period (e.g. DOP = i )
W is the total work (instructions or computations)
performed by a program
A is the average parallelism in the program
Average Parallelism - 2
}
A =
2
1
) (
t
t
dt t DOP W

=
A =
m
i
i
t i W
1
where t
i
= total time that DOP = i, and

=
=
m
i
i
t t t
1
1 2
Average Parallelism - 3
}

=
2
1
) (
1
1 2
t
t
dt t DOP
t t
A
|
.
|

\
|
|
.
|

\
|
=

= =
m
i
i
m
i
i
t t i A
1 1
/
Available Parallelism
Various studies have shown that the potential
parallelism in scientific and engineering
calculations can be very high (e.g. hundreds or
thousands of instructions per clock cycle).
But in real machines, the actual parallelism is much
smaller (e.g. 10 or 20).
Basic Blocks
A basic block is a sequence or block of instructions
with one entry and one exit.
Basic blocks are frequently used as the focus of
optimizers in compilers (since its easier to manage
the use of registers utilized in the block).
Limiting optimization to basic blocks limits the
instruction level parallelism that can be obtained
(to about 2 to 5 in typical code).
Asymptotic Speedup - 1

=
=
m
i
i
W W
1
i i
t i W A =
(work done when DOP = i)
(relates sum of W
i
terms to W)
A = k W k t
i i
/ ) (
(execution time with k processors)
A = i W t
i i
/ ) (
(for 1 s i s m)
A = / ) (
i i
W k t
(execution time with 1 processors)
Asymptotic Speedup - 2

= =
A
= =
m
i
m
i
i
i
W
t T
1 1
) 1 ( ) 1 ( (resp. time w/ 1 proc.)

= =
A
= =
m
i
m
i
i
i
i
W
t T
1 1
) ( ) ( (resp. time w/ proc.)
A
i W
W
T
T
S
m
i
i
m
i
i
= =

=
=

1
1
/
) (
) 1 (
(in the ideal case)
S are asymptotic speedup
Mean Performance Calculation
We seek to obtain a measure that characterizes
the mean, or average, performance of a set of
benchmark programs with potentially many
different execution modes (e.g. scalar, vector,
sequential, parallel).
We may also wish to associate weights with these
programs to emphasize these different modes and
yield a more meaningful performance measure.
Arithmetic Mean
The arithmetic mean is familiar (sum of the terms
divided by the number of terms).
Our measures will use execution rates expressed in
MIPS or Mflops.
The arithmetic mean of a set of execution rates is
proportional to the sum of the inverses of the
execution times; it is not inversely proportional to
the sum of the execution times.
Thus arithmetic mean fails to represent real times
consumed by the benchmarks when executed.
Geometric Mean
A geometric mean of n terms is the n
th
root of the
product of the n terms.
Like the arithmetic mean, the geometric mean of a
set of execution rates does not have an inverse
relationship with the total execution time of the
programs.
(Geometric mean has been advocated for use with
normalized performance numbers for comparison
with a reference machine.)
Harmonic Mean
Instead of using arithmetic or geometric mean, we
use the harmonic mean execution rate, which is
just the inverse of the arithmetic mean of the
execution time (thus guaranteeing the inverse
relation not exhibited by the other means).
( )

=
=
m
i
i
h
R
m
R
1
/ 1
Weighted Harmonic Mean
If we associate weights f
i
with the benchmarks,
then we can compute the weighted harmonic
mean:
( )

=
=
m
i
i i
h
R f
m
R
1
/
Weighted Harmonic Mean Speedup
T
1
= 1/R
1
= 1 is the sequential execution time on a single
processor with rate R
1
= 1.
T
i
= 1/R
i
= 1/i = is the execution time using i processors
with a combined execution rate of R
i
= i.
Now suppose a program has n execution modes with
associated weights f
1
f
n
. The weighted harmonic mean
speedup is defined as:
( )
*
1
1
1
/
/
n
i i
i
S T T
f R
=
= =

* *
1/
h
T R =
(weighted arithmetic
mean execution time)
Amdahls Law
Assume R
i
= i, and w (the weights) are (o, 0, , 0, 1-o).
Basically this means the system is used sequentially (with
probability o) or all n processors are used (with probability
1- o).
This yields the speedup equation known as Amdahls law:
( )
1 1
n
n
S
n o
=
+
The implication is that the best speedup possible is 1/ o,
regardless of n, the number of processors.
System Efficiency 1
Assume the following definitions:
O (n) = total number of unit operations performed by an n-
processor system in completing a program P.
T (n) = execution time required to execute the program P on an n-
processor system.
O (n) can be considered similar to the total number of
instructions executed by the n processors, perhaps scaled
by a constant factor.
If we define O (1) = T (1), then it is logical to expect that
T (n) < O (n) when n > 1 if the program P is able to make
any use at all of the extra processor(s).
System Efficiency 2
Clearly, the speedup factor (how much faster the program
runs with n processors) can now be expressed as
S (n) = T (1) / T (n)

Recall that we expect T (n) < T (1), so S (n) > 1.
System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n T (n) )
It indicates the actual degree of speedup achieved in a
system as compared with the maximum possible speedup.
Thus 1 / n s E (n) s 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1 when
all processors are fully utilized.
Redundancy
The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)
What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations
performed is independent of the number of processors, n. This is
the ideal case.
R (n) = n when all processors performs the same number of
operations as when only a single processor is used; this implies that
n completely redundant computations are performed!
The R (n) figure indicates to what extent the software
parallelism is carried over to the hardware implementation
without having extra operations performed.
System Utilization
System utilization is defined as
U (n) = R (n) E (n) = O (n) / ( n T (n) )
It indicates the degree to which the system
resources were kept busy during execution of the
program. Since 1 s R (n) s n, and 1 / n s E (n) s
1, the best possible value for U (n) is 1, and the
worst is 1 / n.
1 / n s E (n) s U (n) s 1
1 s R (n) s 1 / E (n) s n
Quality of Parallelism
The quality of a parallel computation is defined as
Q (n) = S (n) E (n) / R (n)
= T
3
(1) / ( n T
2
(n) O (n) )
This measure is directly related to speedup (S) and
efficiency (E), and inversely related to redundancy
(R).
The quality measure is bounded by the speedup
(that is, Q (n) s S (n) ).
Standard Industry Performance Measures
MIPS and Mflops, while easily understood, are poor
measures of system performance, since their interpretation
depends on machine clock cycles and instruction sets. For
example, which of these machines is faster?
a 10 MIPS CISC computer
a 20 MIPS RISC computer
It is impossible to tell without knowing more details about
the instruction sets on the machines. Even the question,
which machine is faster, is suspect, since we really need
to say faster at doing what?
Doing What?
To answer the doing what? question, several standard
programs are frequently used.
The Dhrystone benchmark uses no floating point instructions,
system calls, or library functions. It uses exclusively integer data
items. Each execution of the entire set of high-level language
statements is a Dhrystone, and a machine is rated as having a
performance of some number of Dhrystones per second (sometimes
reported as KDhrystones/sec).
The Whestone benchmark uses a more complex program involving
floating point and integer data, arrays, subroutines with
parameters, conditional branching, and library functions. It does
not, however, contain any obviously vectorizable code.
The performance of a machine on these benchmarks
depends in large measure on the compiler used to generate
the machine language. [Some companies have, in the
past, actually tweaked their compilers to specifically deal
with the benchmark programs!]
Whats VAX Got To Do With It?
The Digital Equipment VAX-11/780 computer for
many years has been commonly agreed to be a 1-
MIPS machine (whatever that means).
Since the VAX-11/780 also has a rating of about
1.7 KDhrystrones, this gives a method whereby a
relative MIPS rating for any other machine can be
derived: just run the Dhrystone benchmark on the
other machine, divide by 1.7K, and you then obtain
the relative MIPS rating for that machine
(sometimes also called VUPs, or VAX units of
performance).
Other Measures
Transactions per second (TPS) is a measure that is
appropriate for online systems like those used to support
ATMs, reservation systems, and point of sale terminals.
The measure may include communication overhead,
database search and update, and logging operations. The
benchmark is also useful for rating relational database
performance.
KLIPS is the measure of the number of logical inferences
per second that can be performed by a system, presumably
to relate how well that system will perform at certain AI
applications. Since one inference requires about 100
instructions (in the benchmark), a rating of 400 KLIPS is
roughly equivalent to 40 MIPS.

You might also like