0% found this document useful (0 votes)
88 views

Chapter 1 Lecture 2 & 3 - Computer Performance

Improving the data cache is more effective for improving overall performance in this case.

Uploaded by

Isiyak Solomon
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Chapter 1 Lecture 2 & 3 - Computer Performance

Improving the data cache is more effective for improving overall performance in this case.

Uploaded by

Isiyak Solomon
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture Two and Three

Computer Performance Evaluation


and Metrics

1
• Lecture Objectives : -
– Understand the ways in which computer
performance is measured.
– Become familiar with factors that contribute to
improvements in CPU and disk performance.
• Goal : - to understand cost and performance
association of architectural choice.

2
Computer Performance Metrics

Response Time
Delay between start end time of a task

Throughput
Numbers of tasks per given time

 Power/Energy
Energy per task, power
3
Metrics for different applications

Desktop computing
Metrics: performance (latency), cost

Server computing
Examples: web servers, transaction servers, file servers
Metrics: performance (throughput), reliability

Embedded computing
Examples: printer, cell phone, video console
Metrics: performance (real-time), cost, power consumption

4
Measures of system performance
• Measures of system performance depend upon one’s
point of view.

– A computer user is most often concerned with response time:


How long does it take the system to carry out a task?

– System administrators are usually more concerned with


throughput: How many concurrent tasks can the system handle
before response time is adversely affected?

.
5
Airline example

Plane Speed Range Passengers Time (hrs) Throughput


(mph) (miles) (p x mph)

Boeing 777 610 4630 375 6.5 228,750

Boeing 747 610 4150 470 6.5 286,700

Concorde 1350 4000 132 3.0 178,200

DC 8-50 544 8720 146 7.4 79,424

performance depends on the specific interest


6
How a computer design increase a
performance
• Make the Common Case Fast
The most important and general principle of computer design is
to make the common case fast:
• In making a design trade-off, favor the frequent case over
the infrequent case.
• This principle also applies when determining how to spend
resources.
• Opportunity for improvement is affected by how much
time the event consumes
– E.g., Instruction fetch and decode unit used more frequently than
multiplier, so optimize it 1st

7
Amdahl’s Law

• Amdahl’s Law tells us the system performance gain realized


from the speedup of one component depends not only on the
speedup of the component itself, but also on the fraction of
work done by the component:

Speedup due to enhancement E:


ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E

8
Amhdahl’s Law [contd…]

ExTime = ExTime x (1 - Fraction ) + Fraction


new old enhanced enhanced
Speedup
enhanced

Speedup = ExTime =
overall old (1 - Fraction ) + Fraction
enhanced enhanced
ExTime Speedup
new enhanced

• Amdahl’s Law can serve as a guide to how much an enhancement will

improve performance.

•Useful for comparing : -


 System performance of two alternatives
9
Example (1):- Design comparison
• Suppose FP square root (FPSQR) is responsible for 20% of the
execution time of a critical graphics benchmark.
– One proposal is to enhance the FPSQR hardware and speed up
this operation by a factor of 10.
– The other alternative is just to try to make all FP instructions
in the graphics processor run faster by a factor of 1.6; FP
instructions are responsible for a total of 50% of the execution
time for the application.
– Compare these two alternatives.

10
We can compare these two alternatives by
comparing the speedups.

• Improving the performance of the FP operations overall is slightly

better because of the higher frequency.

11
–Example (2)
• You have a system that contains a special processor for doing
floating-point operations. You have determined that 50% of
your computations can use the floating-point processor. The
speedup of the floating pointing-point processor is 15.
• a) Overall speedup achieved by using the floating-point
processor.

12
b) Overall speedup achieved if you modify the compiler so
that 75% of the computations can use the floating-point
processor.

c) What fraction of the computations should be able to use


the floating–point processor in order to achieve an overall
speedup of 2.25?

13
14
Example (3):-

• You have a system that contains a special processor for doing


floating-point operations. You have determined that 60% of your
computations can use the floating-point processor. When a program
uses the floating-point processor, the speedup of the floating-point
processor is 40% faster than when it doesn’t use it.

– a) Overall speedup by using the floating-point processor.

15
b) In order to improve the speedup consider two
options:
• Option 1: Modifying the compiler so that 70% of the computations
can use the floating-point processor. Cost of this option is $50K.

• Option 2: Modifying the floating-point processor. The speedup of


the floating-point processor is 100% faster than when it doesn’t use
it. Assume in this case that 50% of the computations can use the
floating–point processor. Cost of this option is $60K.

• Which option would you recommend? Justify your answer


quantitatively.

16
Therefore, Option 1 is better because it has a smaller Cost/Speedup ratio.

17
Example (4):-

A program runs in 100 seconds on a machine, with multiply operations responsible for 80
seconds of this time.

 How much do I have to improve the speed of multiplication if I want my program to run
five times faster ?

Execution Time After improvement =


(exec time affected by improvement/amount of improvement) + exec time unaffected
exec time after improvement = (80 seconds / n) + (100 – 80 seconds)
We want performance to be 5 times faster =>

20 seconds = 80/n seconds / n + 20 seconds


0 = 80 / n !!!!

18
CPU performance equation
• All computers are constructed using a clock running at
a constant rate called clock ticks, clock periods, cycles,
or clock cycles.
– Computer designers refer to the time of a clock period by its
duration (e.g., 1 ns) or by its rate (e.g., 1 GHz).
• CPU time can be expressed two ways:

19
CPU perform…..cont’d

• If we know the number of clock cycles and the IC


we can calculate the average number of clock
cycles per instruction (CPI)

20
CPU perform…..cont’d

 CPU performance is dependent upon three


characteristics: clock cycle (or rate), CPI, and IC.

• How to improve performance?


– Everything else being equal we can either
• increase the clock rate,
• Reduce clock cycle time or
• reduce number of cycles for a program.
– A 10% improvement in any one of them leads to a 10%
improvement in CPU time.
21
CPU perform…..cont’d

•To calculate the number of total CPU clock cycles:

– where ICi - number of times instruction is executed in a program


– CPIi - the average number of instructions per clock for instruction i.
•This form can be used to express CPU time as

22
23
24
Example - 2
Op Freq Cycles CPI
ALU 50% 1 .5
Load 20% 5 1.0
Store 10% 3 .3
Branch 20% 2 .4
2.2
How much faster would the machine be if a better data cache reduced the
average load time to 2 cycles?
• Load  20% x 2 cycles = .4
• Total CPI 2.2  1.6
• Relative performance is 2.2 / 1.6 = 1.38

How does this compare with reducing the branch instruction to 1 cycle?
• Branch  20% x 1 cycle = .2
• Total CPI 2.2  2.0
• Relative performance is 2.2 / 2.0 = 1.1

ECE 361 2-25


Example – 3
• Suppose we have made the following processor
measurements:
– Frequency of FP operations (other than FPSQR) = 25%
– Average CPI of FP operations = 4.0
– Average CPI of other instructions = 1.33
– Frequency of FPSQR= 2% CPI of FPSQR = 20
• Two design alternatives are provided for system
enhancement :-
– To decrease the CPI of FPSQR to 2 or
– To decrease the average CPI of all FP operations to 2.5.
Compare these two design alternatives using the CPU performance
equation
26
CPU perform…..cont’d

First, observe that only the CPI changes; the clock rate and instruction count

remain identical. We start by finding the original CPI with neither enhancement:

We can compute the CPI for the enhanced FPSQR by subtracting the cycles saved

from the original CPI:

27
CPU perform…..cont’d

We can compute the CPI for the enhancement of all FP instructions the same way or by

summing the FP and non-FP CPIs. Using the latter gives us

Since the CPI of the overall FP enhancement is slightly lower, its performance will be

marginally better. Specifically, the speedup for the overall FP enhancement is

28
Example - 4
• For the purpose of solving a given application problem, you
benchmark a program on two computer systems.
– On system A, the object code executed 80 million Arithmetic Logic Unit
operations (ALU ops), 40 million load instructions, and 25 million branch
instructions.
– On system B, the object code executed 50 million ALU ops, 50 million
loads, and 40 million branch instructions.
– In both systems, each ALU op takes 1 clock cycles, each load takes 3
clock cycles, and each branch takes 5 clock cycles.
• Find the CPI for each system.
• Assuming that the clock on system B is 10% faster than the clock on system
A, which system is faster for the given application problem and by how much
percent?

29
 a) relative frequency of Compute the relative frequency of
occurrence of each type of instruction executed in both
systems occurrence

• b) Find the CPI for each system.

30
• c) Assuming that the clock on system B is 10% faster than the clock
on system A, which system is faster for the given application
problem and by how much percent?

31
Example - 5
• Suppose we are considering two alternatives for our conditional
branch instructions, as follows:
– CPU A: A condition code is set by a compare instruction and followed by
a branch that tests the condition code.
– CPU B: A compare is included in the branch.
• On both CPUs, the conditional branch instruction takes 2 cycles,
and all other instructions take 1 clock cycle. On CPU A, 20% of
all instructions executed are conditional branches; since every
branch needs a compare, another 20% of the instructions are
compares. Because CPU A does not have the compare included
in the branch, assume that its clock cycle time is 1.25 times faster
than that of CPU B. Which CPU is faster? Suppose CPU A’s
clock cycle time was only 1.1 times faster?
32
• Since we are ignoring all systems issues, we can use the CPU
performance formula:

• since 20% are branches taking 2 clock cycles and the rest of the
instructions take 1 cycle each. The performance of CPU A is then

• Clock cycle timeB is 1.25 × Clock cycle timeA, since A has a clock
rate that is 1.25 times higher. Compares are not executed in CPU B,
so 20%/80% or 25% of the instructions are now branches taking 2
clock cycles, and the remaining 75% of the instructions take 1
cycle. Hence,

33
• Because CPU B doesn’t execute compares, ICB = 0.8
× ICA. Hence, the performance of CPU B is

• Under these assumptions, CPU A, with the shorter clock cycle


time, is faster than CPU B, which executes fewer instructions.
If CPU A were only 1.1 times faster, then Clock cycle timeB is
, 1.10 times clock cycle timeA and the performance of CPU B
is

With this improvement CPU B, which executes fewer instructions, is now faster.
34
Example - 6
• Suppose you have a load/store computer with the following
instruction mix:

• a) Compute the CPI.

35
• b) We observe that 35% of the ALU ops are paired with a load,
and we propose to replace these ALU ops and their loads with a
new instruction. The new instruction takes 1 clock cycle. With the
new instruction added, branches take 5 clock cycles, Compute the
CPI for the new version.

36
• c) If the clock of the old version is 20% faster than the
new version, which version has faster CPU Execution
time and by how much percent

37

You might also like