Computer Architecutre
Computer Architecutre
ECGR 4181/5181
UNCC 2015
Computer Technology
Performance improvements:
Improvements in semiconductor technology
- Feature size, clock speed
- Transistor density increases by 35% per year and die size
increases by 10-20% per year more functionality
- Transistor speed improves linearly with size (complex
equation involving voltages, resistances, capacitances)
can lead to clock speed improvements!
- Wire delays do not scale down at the same rate as logic
delays
- The power wall: it is not possible to consistently run at higher
frequencies without hitting power/thermal limits (Turbo Mode
can cause occasional frequency boosts)
- In a clock cycle, can do more work -- since transistors are
faster, transistors are more energy-efficient, and there are
more of them
ECGR 4181/5181
UNCC 2015
Computer Technology
Performance improvements:
Improvements in computer architectures
- Enabled by HLL compilers, UNIX
- Lead to RISC architectures
Together have enabled:
- Lightweight computers
- Productivity-based managed/interpreted programming
languages
ECGR 4181/5181
UNCC 2015
RISC
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
Classes of Computers
Personal Mobile Device (PMD)
e.g. smart phones, tablet computers
Emphasis on cost, energy efficiency, media performance, and real-time
Desktop Computing
Emphasis on price-performance, energy, graphics
Servers
Emphasis on availability, scalability, throughput, energy
Embedded Computers
Emphasis: price
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
Desktop
Server
Embedded
System price
$1000 - $10,000
$10,000 - $10,000,000
$10 - $100,000
Processor price
$100 - $1000
$200 - $2000
$0.20 - $200
Design issues
Price-perf, graphics
Throughput, availability,
scalability
ECGR 4181/5181
UNCC 2015
Parallelism
Classes of parallelism in applications:
Data-Level Parallelism (DLP)
Task-Level Parallelism (TLP)
ECGR 4181/5181
UNCC 2015
Flynns Taxonomy
Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Vector architectures
Multimedia extensions
Graphics processor units
ECGR 4181/5181
UNCC 2015
UNCC 2015
Trends in Technology
Integrated circuit technology
Transistor density: 35%/year
Die size: 10-20%/year
Integration overall: 40-55%/year
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
UNCC 2015
ECGR 4181/5181
UNCC 2015
Trends in Cost
Cost driven down by learning curve
Yield
ECGR 4181/5181
UNCC 2015
Bose-Einstein formula:
UNCC 2015
Measuring Performance
Typical performance metrics:
Response time
Throughput
Speedup of X relative to Y
Execution timeY / Execution timeX
Execution time
Wall clock time: includes all system overheads
CPU time: only computation time
Benchmarks
SPEC CPU 2006: cpu-oriented programs (for desktops)
SPECweb, TPC: throughput-oriented (for servers)
EEMBC: for embedded processors/workloads
ECGR 4181/5181
UNCC 2015
Principle of Locality
Reuse of data and instructions
ECGR 4181/5181
UNCC 2015
Speedup(E)
ECGR 4181/5181
ExTimebefore
ExTimeafter
]
1
=
[(1-F) +
F ]
S
UNCC 2015
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
Measuring Performance
ECGR 4181/5181
UNCC 2015
Performance Metrics
Purchasing perspective
given a collection of machines, which has the
best performance?
least cost?
best cost/performance?
Design perspective
faced with design options, which has the
best performance improvement?
least cost?
best cost/performance?
Both require
basis for comparison
metric for evaluation
UNCC 2015
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX
Throughput the total amount of work done in a given time
Important to data center managers
UNCC 2015
Performance Factors
Want to distinguish elapsed time and the time spent on
our task
CPU execution time (CPU time) time the CPU spends
working on a task
Does not include time waiting for I/O or running other programs
UNCC 2015
CPI
ECGR 4181/5181
Effective CPI
Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n
(CPIi x ICi)
i=1
UNCC 2015
CPU time
Instruction_count x
CPI
----------------------------------------------clock_rate
UNCC 2015
Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
ECGR 4181/5181
Instruction_
count
CPI
clock_cycle
X
X
UNCC 2015
A Simple Example
Op
Freq
CPIi
Freq x CPIi
ALU
50%
.5
.5
.5
.25
Load
20%
1.0
.4
1.0
1.0
Store
10%
.3
.3
.3
.3
Branch
20%
.4
.4
.2
.4
2.2
1.6
2.0
1.95
UNCC 2015
m/c A
m/c B
m/c C
Prog 1 (secs)
10
20
Prog 2 (secs)
1000
100
20
ECGR 4181/5181
UNCC 2015
AM =
1/n
Timei
i=1
Where Timei is the execution time for the ith program of a total of
n programs in the workload
A smaller mean indicates a smaller average execution time and
thus improved performance
UNCC 2015
A second approach to unequal mixture of programs in the workload normalize execution times
to a reference m/c and then take the average of the normalized execution times.
Geometric mean
n
execution time
i =1
where execution time ratio is the execution time for the program i normalized to the reference m/c
Interesting property
X
Geometric mean( X i )
= Geometric mean i
Geometric mean(Yi )
Yi
ECGR 4181/5181
UNCC 2015
Geometric means of normalized execution times are consistent - does not matter which m/c is
the reference
Not true for arithmetic mean should not be used to average normalized execution times
Example: Comparative performance evaluation where programs were fixed and inputs are not
Competitors could calculate weighted arithmetic mean by using their best performing
benchmark for the longest input.
Geometric mean would be less misleading than arithmetic mean for such situations
ECGR 4181/5181
UNCC 2015
m/c A
m/c B
m/c C
Prog 1(secs)
10
20
Prog 2 (secs)
1000
100
20
P1
1.0
10.0
20.0
0.1
1.0
2.0
0.05
0.5
1.0
P2
1.0
0.1
0.02
10.0
1.0
0.2
50.0
5.0
1.0
AM
1.0
5.05
10.01
5.05
1.0
1.1
25.03
2.75
1.0
GM
1.0
1.00
0.63
1.0
1.0
0.63
1.58
1.58
1.0
Total
1.0
0.11
0.04
9.1
1.0
0.36
25.03
2.75
1.0
ECGR 4181/5181
UNCC 2015
Conclusion In general there is no workload for three or more machines that will
match the performance predicted by the geometric means of normalized execution
times.
Another problem with GM It encourages h/w and s/w designers to focus their
attention on the benchmarks where performance is easiest to improve rather than on
the benchmarks that are slowest.
Example: cut running time from 2 seconds to 1 and from 10,000 seconds to 5,000
ECGR 4181/5181
UNCC 2015
Example
A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.
What is the speedup of the new laptop?
ECGR 4181/5181
UNCC 2015
Example
A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.Their IPCs are
listed below. The binaries are run such that each binary
gets an equal share of CPU time.
What is the speedup of the new laptop?
P1 P2 P3
Old-IPC
1.2 1.6 2.0
New-IPC
1.6 1.6 1.6
ECGR 4181/5181
UNCC 2015
Example
A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.Their IPCs are
listed below. The binaries are run such that each binary
gets an equal share of CPU time.
What is the speedup of the new laptop?
Old-IPC
New-IPC
ECGR 4181/5181
P1 P2
1.2 1.6
1.6 1.6
P3
2.0
1.6
AM GM
1.6 1.57
1.6 1.6
UNCC 2015
ECGR 4181/5181
UNCC 2015
One of the most important principle make the common case fast favor the frequent case over the
infrequent case design trade-off
How much performance can be improved by making the frequent case faster?
Amdahls Law: The performance improvement to be gained from using some faster mode of execution
is limited by the fraction of the time the faster mode can be used defines the speedup that can be
gained by using a particular feature.
speedup =
speedup =
ECGR 4181/5181
exection time for the entire task without using the enhancement
execution time for the entire task using the performance
UNCC 2015
fraction enhancement 1
The improvement gained by the enhanced execution mode how much faster the task
would run if the enhanced mode were used for the entire program
fraction enhanced
speedup overall =
ECGR 4181/5181
execution timeold
1
=
fraction enhanced
execution time new (1 - fraction
)
+
enhanced
speedup enhanced
UNCC 2015
1
(1 - fraction enhanced ) +
fraction enhanced
speedup enhanced
ECGR 4181/5181
1
0.6 +
0.4
10
= 1.56
UNCC 2015
ECGR 4181/5181
UNCC 2015
speedup FPSQR =
speedup FP =
1
(1 - 0.2) +
0.2
10
1
0.5
(1 - 0.5) +
1.6
= 1.22
= 1.23
UNCC 2015
Alternative Decompose CPU execution time into different components and see
how different alternatives affect these components
CPU time = CPU clock cycles for a program x clock cycle time
CPU clock cycles for a program
CPU time =
clock rate
CPI =
ECGR 4181/5181
UNCC 2015
= CPU time
=
program
instruction clock cycle program
CPU performance is dependent on three characteristics: clock cycle (or rate), clock cycles per
instruction, and instruction count
UNCC 2015
CPI =
IC
i =1
CPI i
IC
=
i =1
IC i
CPI i
IC
Note: CPI i should be measured and not just calculated from the reference table
ECGR 4181/5181
UNCC 2015
Example
Suppose we have the following measurements:
Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the
average CPI of all the FP operations to 2.5. Compare the two design alternatives using the
CPU performance equation.
Note: only the CPI changes; the clock rate and IC do not change
ECGR 4181/5181
UNCC 2015
IC
CPI original = CPI i i
IC
i =1
n
CPI new FPSQR = CPI original 2% (CPI old FPSQR CPI new FPSQR only )
=2.0 2% x (20 2) = 1.64
CPI for enhanced FP instructions
CPI original
CPI new FP
ECGR 4181/5181
CPU timeoriginal
CPU time new FP
2.0
= 1.23
1.62
UNCC 2015
Speedup = (IC x old CPI x old clock cycle time)/(IC x new CPI x new clock cycle time)
<1
NO!
ECGR 4181/5181
UNCC 2015
Calculate CPI
Breakdown CPI into compute, memory stalls, etc. identify performance problems
Cycle-level simulators
ECGR 4181/5181
UNCC 2015
Harmonic Mean
If performance is expressed as a rate, then the average that tracks the total execution
time is the harmonic mean:
HM =
n
n
i =1 Rate i
Example: Travel at 40 MPH for one mile and 80 MPH for one mile average is not 60 MPH
ECGR 4181/5181
UNCC 2015
Littles Law
The average number of objects in a queue is the product of the entry rate and the average
holding time
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
Dynamic power
x Capacitive load x Voltage2 x Frequency switched
ECGR 4181/5181
UNCC 2015
Power
Intel 80386
consumed ~ 2 W
3.3 GHz Intel Core
i7 consumes 130
W
Heat must be
dissipated from 1.5
x 1.5 cm chip
This is the limit of
what can be
cooled by air
ECGR 4181/5181
UNCC 2015
Reducing Power
Techniques for reducing power:
Do nothing well
Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks
Overclocking, turning off cores
ECGR 4181/5181
UNCC 2015
Static Power
Static power consumption
Currentstatic x Voltage
Scales with number of transistors
To reduce: power gating
ECGR 4181/5181
UNCC 2015
Power Trends
In CMOS IC technology
5V 1V
1000
UNCC 2015
Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
UNCC 2015
Dependability
Module reliability
Mean time to failure (MTTF)
Mean time to repair (MTTR)
Mean time between failures (MTBF) = MTTF + MTTR
Availability = MTTF / MTBF
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
FailureRate =
MTTF =
ECGR 4181/5181
UNCC 2015
Performance Improvement
Principle of Locality programs tend to reuse data and instructions they have used
recently
Rule of thumb A program spends 90% of its execution time in only 10% of the code. We
can predict with reasonable accuracy what instructions and data a program will use in
near future based on its accesses in the recent past.
Principle of locality also applies to data accesses not as strongly as code accesses.
Two types of locality
ECGR 4181/5181
UNCC 2015
Performance Improvement
ECGR 4181/5181
UNCC 2015
Multiprocessors
Multicore microprocessors
More than one processor per chip
ECGR 4181/5181
UNCC 2015
Fallacy The relative performance of two processors with the same ISA can be judged by clock
rate or by the performance of a single benchmark suite.
As processors have become faster and more sophisticated, processor performance in one
application area can diverge from that in another area. Sometimes the ISA is responsible,
but more often the pipeline structure and memory system are responsible. Thus the clock
rate is not a good metric.
Performance of a 1.7 GHz Pentium 4 relative to a 1 GHz Pentium III
ECGR 4181/5181
UNCC 2015
EEMBC
benchmark set
Compilergenerated
performance
Hand-coded
performance
Ratio
hand/compiler
Trimedia 1300
@ 166MHz
Consumer
23.3
110
4.7
BOPS Manta @
136 MHz
Telecomm
2.6
225.8
86.8
TI
TMS320C6203
@ 300 MHz
Telecomm
6.8
68.5
10.1
Machine
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
Fallacy The best design for a computer is the one that optimizes the primary objective without
considering implementation
Complex designs take longer to complete and prolong time to market design will be less
competitive
Pitfall Neglecting the cost of software in either evaluating a system or examining costperformance
For many years, hardware was so expensive that it dominated the cost of software, but this
is no longer true
For medium-size database sever the software costs are roughly 50% of the total cost while
for desktop the software cost somewhere between 23% and 38%
ECGR 4181/5181
UNCC 2015
Instruction count
clock rate
=
Exection time 106 CPI 106
IC
Execution time =
MIPS 106
MIPS =
Easy to understand
Faster machines have higher MIPS rating matches intuition
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
MIPSM =
Performance M
MIPSreference
Performance reference
ECGR 4181/5181
UNCC 2015
ECGR 4181/5181
UNCC 2015
Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?
Static Metrics:
How many bytes does the program occupy in memory?
Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
How many clocks are required per instruction?
CPI
How "lean" a clock is practical?
Inst. Count
Cycle Time
UNCC 2015