L7 Performance
L7 Performance
Performance
• Speed up of execution
– Response time
• How long it takes to do a task
• Throughput
– Total work done per unit time
• Instruction set, Hardware
• Software – OS, Compilers
Understanding performance
• Algorithm
– Determines number of operations executed
• Programming language, compiler,
architecture
– Determine number of machine instructions
executed per operation
• Processor and memory system
– Determine how fast instructions are executed
• I/O system (including OS)
– Determines how fast I/O operations are executed
Page 23 COA August 2024
Understanding performance
• How programs are translated into the machine
language
– And how the hardware executes them
• The hardware/software interface
• What determines performance of a program
• How hardware designers improve performance
– parallel processing
• How to improve energy efficiency
CPU performance
• Performance = 1/Execution Time
• CPU time = CPU clocks x clock cycle time (tc)
• CPU clocks = Instruction count x clocks/instruction
• CPU time = IC x CPI/clock frequency (fc)
– CPI – average clocks per instruction
• Determined by CPU hardware
• If different instructions have different CPI
– Average CPI affected by instruction mix
• compare two different implementations of the same ISA
– IC – Instruction count
• Determined by program, ISA and compiler
– ISA – Instruction set architecture
Page 26 COA August 2024
Example
Computer tc CPI CPU time Rel Perf
A 250ps 2 ICx2x250ps 𝑃𝑒𝑟𝑓 𝐴 𝐶𝑃𝑈𝑡𝑖𝑚𝑒𝐵
= = 1.2
B 500ps 1.2 ICx1.2x500ps 𝑃𝑒𝑟𝑓 𝐵 𝐶𝑃𝑈𝑡𝑖𝑚𝑒𝐴
Clock Cycles n
Instructio n Count i
CPI = = CPIi
Instructio n Count i=1 Instructio n Count
Power
5-<1V
Power C V2 fc • Dynamic Power
• Leakage
×30 ×1000 Courtesy- H&P, Computer Organisation, 6e
Reducing power
• Suppose a new CPU has
– 85% of capacitive load of old CPU
– 15% voltage and 15% frequency reduction
Processor performance
Multicore processors
• Requires explicitly parallel programming
• Hardware executes multiple instructions at once
• Hidden from the programmer
• Programming for performance
• Scheduling
• Load balancing
• Optimizing communication and synchronization
n
n
Execution time ratio
i=1
i
10 10
Overall ssj_ops per Watt = ssj_ops i power i
i=0 i=0
10 10
Overall ssj_ops per Watt = ssj_ops i power i Courtesy- H&P, Computer Organisation, 6e
i=0 i=0
Page 35 COA August 2024
Amdahl’s Law
• Improving an aspect of a computer and expecting a
proportional improvement in overall performance
• Make the common case fastest
Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor
Example
• Consider three different processors P1, P2, and P3 executing the
same instruction set. P1 has a 3 GHz clock rate and a CPI of
1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0
GHz clock rate and has a CPI of 2.2.
– Which processor has the highest performance expressed in instructions per
second?
Processor Instns/sec
P1 3x109/ 1.5 =2x109
P2 2.5x109/ 1 = 2.5x109
P3 4x109 / 2.2 = 1.8x109
Example
• Consider two different implementations of the same ISA. The
instructions can be divided into four classes according to their CPI
(class A, B, C, and D). P1 with a clock rate of 2.5 GHz and CPIs of 1, 2,
3, and 3, and P2 with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2.
Given a program with a dynamic instruction count of 1.0E6 instructions
divided into classes as follows: 10% class A, 20% class B, 50% class C,
and 20% class D, which implementation is faster? What is the global
CPI for each implementation?
Exercise
• A processor has CPIs of 1, 12, and 5, respectively for arithmetic,
load/store, and branch instructions, Assume that
– On a single processor a program requires the execution of 2.56E9
arithmetic instructions, 1.28E9 load/store instructions, and 256
million branch instructions.
– Each processor has a 2 GHz clock frequency.
– As the program is parallelized to run over multiple cores, the
number of arithmetic and load/store instructions per processor is
divided by 0.7 x p (where p is the number of processors) but the
number of branch instructions per processor remains the same.
• Find the total execution time for this program on 1, 2, 4, and 8
processors, and show the relative speedup of the 2, 4, and 8
processor result relative to the single processor result.
Exercise
• A computer spends 30 percent of its time accessing memory, 20
percent performing multiplications, and 50 percent executing
other instructions. As a computer architect, you have to choose
between improving either the memory, multiplication hardware,
or execution of non multiplication instructions. There is only
space on the chip for one improvement, and each of the
improvements will improve its associated part of the
computation by a factor of 2.
– Without performing any calculations, which improvement would
you expect to give the largest performance increase, and why?
– What speedup would making each of the three changes give?
Instructio n count
MIPS =
Execution time 106
Instructio n count Clock rate
= =
Instructio n count CPI CPI 106
106
Clock rate
Summary
• Cost/performance is improving
– Due to underlying technology development
• Hierarchical layers of abstraction
– In both hardware and software
• Instruction set architecture
– The hardware/software interface
• Execution time: the best performance measure
• Power is a limiting factor
– Use parallelism to improve performance