0% found this document useful (0 votes)
13 views

L7 Performance

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

L7 Performance

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

EC340 COA

Performance
• Speed up of execution
– Response time
• How long it takes to do a task
• Throughput
– Total work done per unit time
• Instruction set, Hardware
• Software – OS, Compilers

• CPU time – user CPU time + system CPU time

Page 22 COA August 2024

Understanding performance
• Algorithm
– Determines number of operations executed
• Programming language, compiler,
architecture
– Determine number of machine instructions
executed per operation
• Processor and memory system
– Determine how fast instructions are executed
• I/O system (including OS)
– Determines how fast I/O operations are executed
Page 23 COA August 2024

Dept of E&C, NITK Surathkal 1


EC340 COA

Understanding performance
• How programs are translated into the machine
language
– And how the hardware executes them
• The hardware/software interface
• What determines performance of a program
• How hardware designers improve performance
– parallel processing
• How to improve energy efficiency

Page 24 COA August 2024

Seven great ideas


• Use abstraction to simplify design
• Make the common case fast
• Performance via parallelism
• Performance via pipelining
• Performance via prediction
• Hierarchy of memories
• Dependability via redundancy

Page 25 COA August 2024

Dept of E&C, NITK Surathkal 2


EC340 COA

CPU performance
• Performance = 1/Execution Time
• CPU time = CPU clocks x clock cycle time (tc)
• CPU clocks = Instruction count x clocks/instruction
• CPU time = IC x CPI/clock frequency (fc)
– CPI – average clocks per instruction
• Determined by CPU hardware
• If different instructions have different CPI
– Average CPI affected by instruction mix
• compare two different implementations of the same ISA
– IC – Instruction count
• Determined by program, ISA and compiler
– ISA – Instruction set architecture
Page 26 COA August 2024

Example
Computer tc CPI CPU time Rel Perf
A 250ps 2 ICx2x250ps 𝑃𝑒𝑟𝑓 𝐴 𝐶𝑃𝑈𝑡𝑖𝑚𝑒𝐵
= = 1.2
B 500ps 1.2 ICx1.2x500ps 𝑃𝑒𝑟𝑓 𝐵 𝐶𝑃𝑈𝑡𝑖𝑚𝑒𝐴

If different instruction classes take different numbers of cycles


n
Clock Cycles =  (CPIi  Instructio n Count i )
i=1

Weighted average CPI

Clock Cycles n
 Instructio n Count i 
CPI = =   CPIi  
Instructio n Count i=1  Instructio n Count 

Page 27 COA August 2024

Dept of E&C, NITK Surathkal 3


EC340 COA

Power

5-<1V
Power  C V2 fc • Dynamic Power
• Leakage
×30 ×1000 Courtesy- H&P, Computer Organisation, 6e

Page 28 COA August 2024

Reducing power
• Suppose a new CPU has
– 85% of capacitive load of old CPU
– 15% voltage and 15% frequency reduction

Pnew Cold  0.85  (Vold  0.85)2  Fold  0.85


= = 0.854 = 0.52
Cold  Vold  Fold
2
Pold

◼ The power wall


◼ We can’t reduce voltage further

◼ We can’t remove more heat

◼ How else can we improve performance?

Page 29 COA August 2024

Dept of E&C, NITK Surathkal 4


EC340 COA

Processor performance

Constrained by power, instruction-level parallelism,


Courtesy- H&P, Computer Organisation, 6e
memory latency
Page 30 COA August 2024

Multicore processors
• Requires explicitly parallel programming
• Hardware executes multiple instructions at once
• Hidden from the programmer
• Programming for performance
• Scheduling
• Load balancing
• Optimizing communication and synchronization

Page 31 COA August 2024

Dept of E&C, NITK Surathkal 5


EC340 COA

SPEC CPU Benchmark


• Programs used to measure performance
– Supposedly typical of actual workload
• Standard Performance Evaluation Coop (SPEC)
– Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2017
– Elapsed time to execute a selection of programs
– Negligible I/O, so focuses on CPU performance
– Normalize relative to reference machine
– Integer (10) and floating-point (13)
– Summarize as geometric mean of performance ratios

n
n
Execution time ratio
i=1
i

Page 32 COA August 2024

SPECspeed 2017 Integer benchmarks on a


1.8 GHz Intel Xeon E5-2650L

Courtesy- H&P, Computer Organisation, 6e

Page 33 COA August 2024

Dept of E&C, NITK Surathkal 6


EC340 COA

SPEC power benchmark


• Power consumption of server at different
workload levels
– Performance: ssj_ops/sec
– Power: Watts (Joules/sec)

 10   10 
Overall ssj_ops per Watt =   ssj_ops i    power i 
 i=0   i=0 

Page 34 COA August 2024

SPECpower_ssj2008 for Xeon E5-2650L

 10   10 
Overall ssj_ops per Watt =   ssj_ops i    power i  Courtesy- H&P, Computer Organisation, 6e
 i=0   i=0 
Page 35 COA August 2024

Dept of E&C, NITK Surathkal 7


EC340 COA

Amdahl’s Law
• Improving an aspect of a computer and expecting a
proportional improvement in overall performance
• Make the common case fastest

Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor

◼ Example: multiply accounts for 80s/100s


◼ How much improvement in multiply performance to
get 5× overall?
80 ◼ Can’t be done!
20 = + 20
n
Page 36 COA August 2024

Example
• Consider three different processors P1, P2, and P3 executing the
same instruction set. P1 has a 3 GHz clock rate and a CPI of
1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0
GHz clock rate and has a CPI of 2.2.
– Which processor has the highest performance expressed in instructions per
second?

Processor Instns/sec
P1 3x109/ 1.5 =2x109
P2 2.5x109/ 1 = 2.5x109
P3 4x109 / 2.2 = 1.8x109

Page 37 COA August 2024

Dept of E&C, NITK Surathkal 8


EC340 COA

Example
• Consider two different implementations of the same ISA. The
instructions can be divided into four classes according to their CPI
(class A, B, C, and D). P1 with a clock rate of 2.5 GHz and CPIs of 1, 2,
3, and 3, and P2 with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2.
Given a program with a dynamic instruction count of 1.0E6 instructions
divided into classes as follows: 10% class A, 20% class B, 50% class C,
and 20% class D, which implementation is faster? What is the global
CPI for each implementation?

– Time = No. instr. x CPI/clock rate

Processor Total Time CPI


P1 10.4x10-4 s 2.6
P2 6.66 x10-4 s 2

Page 38 COA August 2024

Exercise
• A processor has CPIs of 1, 12, and 5, respectively for arithmetic,
load/store, and branch instructions, Assume that
– On a single processor a program requires the execution of 2.56E9
arithmetic instructions, 1.28E9 load/store instructions, and 256
million branch instructions.
– Each processor has a 2 GHz clock frequency.
– As the program is parallelized to run over multiple cores, the
number of arithmetic and load/store instructions per processor is
divided by 0.7 x p (where p is the number of processors) but the
number of branch instructions per processor remains the same.
• Find the total execution time for this program on 1, 2, 4, and 8
processors, and show the relative speedup of the 2, 4, and 8
processor result relative to the single processor result.

Page 39 COA August 2024

Dept of E&C, NITK Surathkal 9


EC340 COA

Exercise
• A computer spends 30 percent of its time accessing memory, 20
percent performing multiplications, and 50 percent executing
other instructions. As a computer architect, you have to choose
between improving either the memory, multiplication hardware,
or execution of non multiplication instructions. There is only
space on the chip for one improvement, and each of the
improvements will improve its associated part of the
computation by a factor of 2.
– Without performing any calculations, which improvement would
you expect to give the largest performance increase, and why?
– What speedup would making each of the three changes give?

Page 40 COA August 2024

MIPS as performance benchmark


• MIPS: Millions of Instructions Per Second
– Doesn’t account for
• Differences in ISAs between computers
• Differences in complexity between instructions

Instructio n count
MIPS =
Execution time  106
Instructio n count Clock rate
= =
Instructio n count  CPI CPI 106
 106
Clock rate

◼ CPI varies between programs on a given CPU

Page 41 COA August 2024

Dept of E&C, NITK Surathkal 10


EC340 COA

Summary
• Cost/performance is improving
– Due to underlying technology development
• Hierarchical layers of abstraction
– In both hardware and software
• Instruction set architecture
– The hardware/software interface
• Execution time: the best performance measure
• Power is a limiting factor
– Use parallelism to improve performance

Page 42 COA August 2024

Dept of E&C, NITK Surathkal 11

You might also like