0% found this document useful (0 votes)
68 views

Cs2100 14 Understanding Performance

This document discusses performance metrics and factors that affect computer performance. It defines key performance metrics like response time and throughput. Performance is affected by the number of instructions, average clock cycles per instruction (CPI), and clock rate. CPI depends on the instruction mix and implementation. The document outlines how benchmarks are used to compare performance between systems using a standardized workload. Common benchmarks like SPEC CPU are discussed.

Uploaded by

amanda
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Cs2100 14 Understanding Performance

This document discusses performance metrics and factors that affect computer performance. It defines key performance metrics like response time and throughput. Performance is affected by the number of instructions, average clock cycles per instruction (CPI), and clock rate. CPI depends on the instruction mix and implementation. The document outlines how benchmarks are used to compare performance between systems using a standardized workload. Common benchmarks like SPEC CPU are discussed.

Uploaded by

amanda
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

CS2100 Computer Organisation

Understanding Performance

Performance
Metrics

Purchasing perspective
given a collection of machines, which has the
best performance ?
least cost ?
best cost/performance?

Design perspective
faced with design options, which has the
best performance improvement ?
least cost ?
best cost/performance?

2011 Sem 1

Understanding Performance

Performance
Metrics

Both require
basis for comparison
metric for evaluation

Goal: Understand what factors in the architecture


contribute to overall system performance and the relative
importance (and cost) of these factors

2011 Sem 1

Understanding Performance

On cost and pricing

Technology cost
R&D

Manufacturing cost
Raw material
Cost of factory

Marketing cost

Distribution (including advertising)


Price of competitor
Volume
Consumer expectation

2011 Sem 1

Understanding Performance

Moores Law
The number of transistors that can be
inexpensively placed on an integrated circuit
is increasing exponentially, doubling
approximately every two years.
- Gordon Moore (1965)

2011 Sem 1

co-founder of Intel

Understanding Performance

Moores Law

2011 Sem 1

Understanding Performance

Main driver: device scaling ...

From: Facing the Hot Chips Challenge Again, Bill Holt, Intel, presented at Hot Chips 17, 2005.
2011 Sem 1

Understanding Performance

Secondary driver: Wafer size

From: Facing the Hot Chips


Challenge Again, Bill Holt, Intel,
presented at Hot Chips 17,
2005.

2011 Sem 1

Understanding Performance

Intel Penryn
45nm technology
Quad-core will have
820 million
transistors
Dual core will be
107mm2

2011 Sem 1

Understanding Performance

Defining (Speed) Performance

Normally interested in reducing


Response time (aka execution time) the time between the start and
the completion of a task
Important to individual users
Thus, to maximize performance, need to minimize execution time

performanceX = 1 / execution_timeX

2011 Sem 1

Understanding Performance

10

Defining (Speed) Performance

If X is n times faster than Y, then


performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX
Throughput the total amount of work done in a given time
Important to data center managers
Decreasing response time almost always improves throughput

2011 Sem 1

Understanding Performance

11

Performance Factors

Want to distinguish elapsed time and the time spent on our task

CPU execution time (CPU time) time the CPU spends working on
a task
Does not include time waiting for I/O or running other programs

2011 Sem 1

Understanding Performance

12

Performance Factors
CPU execution time = # CPU clock cycles x clock cycle
for a program
for a program
time
or

CPU execution time = #------------------------------------------CPU clock cycles for a program


for a program
clock rate

Can improve performance by reducing either the length


of the clock cycle or the number of clock cycles required
for a program

2011 Sem 1

Understanding Performance

13

Review: Machine Clock Rate

Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)
CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate


5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate

2011 Sem 1

1 nsec clock cycle =>

1 GHz clock rate

500 psec clock cycle =>

2 GHz clock rate

250 psec clock cycle =>

4 GHz clock rate

200 psec clock cycle =>

5 GHz clock rate

Understanding Performance

14

Myth: Higher clock = faster CPU

2011 Sem 1

Understanding Performance

15

Clock Cycles per Instruction

Not all instructions take the same amount of time to execute


One way to think about execution time is that it equals the number of
instructions executed multiplied by the average time per instruction

# CPU clock cycles


# Instructions Average clock cycles
= for a program x
for a program
per instruction

2011 Sem 1

Understanding Performance

16

Clock Cycles per Instruction

Clock cycles per instruction (CPI) the average number


of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA

CPI for this instruction


class
CPI

2011 Sem 1

A
1

B
2

Understanding Performance

C
3

17

Effective CPI

Computing the overall effective CPI is done by looking at


the different types of instructions and their individual
cycle counts and averaging
n

Overall effective CPI =

i=1

(CPIi x ICi)

Where ICi is the count (percentage) of the number of instructions


of class i executed
CPIi is the (average) number of clock cycles per instruction for
that instruction class
n is the number of instruction classes
2011 Sem 1

Understanding Performance

18

Effective CPI

The overall effective CPI varies by instruction mix a


measure of the dynamic frequency of instructions across
one or many programs

2011 Sem 1

Understanding Performance

19

The Performance Equation

Our basic performance equation is then

CPU time

= Instruction_count x CPI x clock_cycle


or

CPU time

2011 Sem 1

Instruction_count x
CPI
----------------------------------------------clock_rate

Understanding Performance

20

The Performance Equation

These equations separate the three key factors that


affect performance
Can measure the CPU execution time by running the program
The clock rate is usually given
Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which
we must know the implementation details

2011 Sem 1

Understanding Performance

21

Determinates of CPU
Performance
CPU time

= Instruction_count x CPI x clock_cycle


Instruction_count

CPI

clock_cycle

Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
2011 Sem 1

Understanding Performance

22

Determinates of CPU
Performance
CPU time

= Instruction_count x CPI x clock_cycle


Instruction_count

CPI

Algorithm

Programming
language

Compiler

ISA
Processor
organization
Technology
2011 Sem 1

clock_cycle

X
Understanding Performance

23

A Simple Example
Op

Freq

CPIi

Freq x CPIi

ALU

50%

Load

20%

Store

10%

Branch

20%

How much faster would the machine be if a better data cache reduced
the average load time to 2 cycles?

How does this compare with using branch prediction to shave a cycle off
the branch time?

What if two ALU instructions could be executed at once?

2011 Sem 1

Understanding Performance

24

A Simple Example
Op

Freq

CPIi

Freq x CPIi

ALU

50%

.5

.5

.5

.25

Load

20%

1.0

.4

1.0

1.0

Store

10%

.3

.3

.3

.3

Branch

20%

.4

.4

.2

.4

2.2

1.6

2.0

1.95

2011 Sem 1

Understanding Performance

25

Benchmarking

How not to compare apples with oranges?


Use a standard set of programs to test performance
Need more than a single program

A benchmark suite is a standard set of (usually


portable) programs used for performance testing
Input specified
Expected output checked (with error margins defined)

2011 Sem 1

Understanding Performance

26

Comparing and Summarizing


Performance

How do we summarize the performance for benchmark set with a


single number?
The average of execution times that is directly proportional to total
execution time is the arithmetic mean (AM)

AM =

1/n

Timei
i=1

Where Timei is the execution time for the ith program of a total of n
programs in the workload
A smaller mean indicates a smaller average execution time and thus
improved performance

2011 Sem 1

Understanding Performance

27

Comparing and Summarizing


Performance

Guiding principle in reporting performance measurements is


reproducibility list everything another experimenter would need to
duplicate the experiment (version of the operating system, compiler
settings, input set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc.))

2011 Sem 1

Understanding Performance

28

SPEC2000 Benchmarks www.spec.org


Integer benchmarks

FP benchmarks

gzip

compression

wupwise

Quantum chromodynamics

vpr

FPGA place & route

swim

Shallow water model

gcc

GNU C compiler

mgrid

Multigrid solver in 3D fields

mcf

Combinatorial optimization

applu

Parabolic/elliptic pde

crafty

Chess program

mesa

3D graphics library

parser

Word processing program

galgel

Computational fluid dynamics

eon

Computer visualization

art

Image recognition (NN)

perlbmk

perl application

equake

Seismic wave propagation


simulation

gap

Group theory interpreter

facerec

Facial image recognition

vortex

Object oriented database

ammp

Computational chemistry

bzip2

compression

lucas

Primality testing

twolf

Circuit place & route

fma3d

Crash simulation fem

sixtrack

Nuclear physics accel

apsi

Pollutant distribution

2011 Sem 1

Understanding Performance

29

SPEC2006 Benchmarks www.spec.org


Integer benchmarks
perlbmk

perl application

bzip2

compression

gcc

GNU C compiler

mcf

Combinatorial optimization

go

Go program

hmmer

Gene sequencing

sjeng

AI pattern recognition

libquantum

Quantum computing

h264ref

Video compression

omnetpp

Discrete event simulation

astar

Computer game

xalancbmk

XML processor

2011 Sem 1

Understanding Performance

30

SPEC2006 Benchmarks www.spec.org


Floating point benchmarks
bwaves

Computational fluid dynamics

gamess

Quantum chemical computations

milc

Quantum Chromodynamics

zeusmp

Magnetohydrodynamics

gromacs

Molecular Dynamics

cactusADM

General relativity calculations

leslie3D

Computational fluid dynamics

namd

Molecular Dynamics Simulation

dealII

Solution of Partial Differential Equations

soplex

Simplex Linear Program (LP) Solver

povray

Computer Visualization

calculix

Structural Mechanics

gemFDTD

Computational Electromagnetics

tonto

Quantum Crystallography

lbm

CFD Lattice Boltzmann Method

wrf

Weather Forecasting

2011
Sem 1
sphinx3

Understanding Performance
Speech Recognition

31

SPEC_rate
After the benchmarks are run on the system under test (SUT), a ratio for each of them is
calculated using the run time on the SUT and a SPEC-determined reference time.
From these ratios, the following metrics are calculated:
CINT2006 (for integer compute intensive performance comparisons):
* SPECint2006: The geometric mean of twelve normalized ratios - one for each
integer benchmark - when the benchmarks are compiled with peak tuning.
* SPECint_base2006: The geometric mean of twelve normalized ratios when the
benchmarks are compiled with base tuning.
* SPECint_rate2006: The geometric mean of twelve normalized throughput ratios
when the benchmarks are compiled with peak tuning.
* SPECint_rate_base2006: The geometric mean of twelve normalized throughput
ratios when the benchmarks are compiled with base tuning.
Similar metrics are used for the FP benchmarks.
In all cases, a higher score means "better performance" on the given workload.
2011 Sem 1

Understanding Performance

32

Notes

The geometric mean of a data set [a1, a2, ..., an] is


given by

SPEC uses a historical Sun system, the Ultra


Enterprise 2 which was introduced in 1997, as the
reference machine. The reference machine uses a 296
MHz UltraSPARC II processor.

2011 Sem 1

Understanding Performance

33

Performance improvement

2011 Sem 1

Understanding Performance

34

Other benchmarks (Windows world)

BAPCO's SYSmark 2004 SE. Based on real Windows applications


including Office.
POV-Ray 3.70 beta 20.
3ds Max 8. 3D modeling and animation tool.
NewTek LightWave. Another 3D modeling and animation tool used
primarily for special effects.
Adobe After Effects.
Maxon's Cinebench, 3D content creation benchmark.
Windows Media Encoder.
Mainconcept H.264 and MPEG-2 Encoders.
3DMark06 and PCMark05. Synthetic benchmarks.
Games, such as Flight Simulator X, Supreme Commander, Prey,
and Company of Heroes.

2011 Sem 1

Understanding Performance

35

The Challenges of Benchmarking

Products can be tuned for specific benchmarks


Measures only performance
Security, cost, reliability, serviceability, scalability etc. not
measured

May not reflect the actual workload


A good read: http://donutey.com/hardwaretesting.php

2011 Sem 1

Understanding Performance

36

Other Performance Metrics

Power consumption especially in the embedded market where


battery life is important (and passive cooling)
For power-limited applications, the most important metric is energy
efficiency

2011 Sem 1

Understanding Performance

37

The Challenge of Power

Source: Fred Pollack, Intel Corp.


2011 Sem 1

Understanding Performance

38

Intels chip

2011 Sem 1

Understanding Performance

39

The reality of cooling


Cooling tower

Desktop PC
Motherboard
2011 Sem 1

Understanding Performance

40

Summary: Evaluating ISAs

Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?

Static Metrics:
How many bytes does the program occupy in memory?

Dynamic Metrics:
How many instructions are executed? How many bytes does the
CPI
processor fetch to execute the program?
How many clocks are required per instruction?
How "lean" a clock is practical?

Best Metric: Time to execute the program!


depends on the instructions set, the processor
organization, and compilation techniques.
2011 Sem 1

Inst. Count

Understanding Performance

Cycle Time
41

Multicore processors

Both Intel and AMD already sell 2-4 core


processors
A single chip contains 2-4 CPUs

These can be used as throughput machines


For a single application, to take advantage of
multicore must go for parallel processing

2011 Sem 1

Understanding Performance

42

Intel Core i7 (Sandy Bridge)

2011 Sem 1

Understanding Performance

43

Intel Core i7 (Sandy Bridge)

2011 Sem 1

Understanding Performance

44

Parallel programming matters!

QX6850 4 cores
2011 Sem 1

E6850 2 cores

Understanding Performance

45

END

2011 Sem 1

Understanding Performance

46

You might also like