0% found this document useful (0 votes)

143 views

Computer Architecutre

This document provides an introduction to computer architecture and performance. It discusses several topics: 1) Performance improvements in semiconductor technology including shrinking transistor size, increasing transistor density and speed, and limitations in wire scaling. 2) Improvements in computer architectures enabled by high-level languages, compilers, and RISC architectures that have allowed for lightweight and managed programming languages. 3) Current trends requiring explicit restructuring of applications to leverage data, thread, and request-level parallelism as instruction-level parallelism improvements have slowed.

Uploaded by

Balaji Mc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

143 views

Computer Architecutre

Uploaded by

Balaji Mc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Introduction

Background: ECGR 3183 or equivalent, based on

Hennessy and Pattersons Computer Organization and
Design

ECGR 4181/5181

UNCC 2015

Computer Technology
Performance improvements:
Improvements in semiconductor technology
- Feature size, clock speed
- Transistor density increases by 35% per year and die size
increases by 10-20% per year more functionality
- Transistor speed improves linearly with size (complex
equation involving voltages, resistances, capacitances)
can lead to clock speed improvements!
- Wire delays do not scale down at the same rate as logic
delays
- The power wall: it is not possible to consistently run at higher
frequencies without hitting power/thermal limits (Turbo Mode
can cause occasional frequency boosts)
- In a clock cycle, can do more work -- since transistors are
faster, transistors are more energy-efficient, and there are
more of them
ECGR 4181/5181

UNCC 2015

Computer Technology
Performance improvements:
Improvements in computer architectures
- Enabled by HLL compilers, UNIX
- Lead to RISC architectures
Together have enabled:
- Lightweight computers
- Productivity-based managed/interpreted programming
languages

ECGR 4181/5181

UNCC 2015

Single Processor Performance

Move to multi-processor

RISC

ECGR 4181/5181

UNCC 2015

Current Trends in Architecture

Cannot continue to leverage Instruction-Level
parallelism (ILP)
Single processor performance improvement ended in
2003

New models for performance:

Data-level parallelism (DLP)
Thread-level parallelism (TLP)
Request-level parallelism (RLP)

These require explicit restructuring of the

application

ECGR 4181/5181

UNCC 2015

Classes of Computers
Personal Mobile Device (PMD)
e.g. smart phones, tablet computers
Emphasis on cost, energy efficiency, media performance, and real-time

Desktop Computing
Emphasis on price-performance, energy, graphics

Servers
Emphasis on availability, scalability, throughput, energy

Clusters / Warehouse Scale Computers

Used for Software as a Service (SaaS)
Emphasis on availability, price-performance, energy proportionality
Sub-class: Supercomputers, emphasis: floating-point performance and
fast internal networks

Embedded Computers
Emphasis: price
ECGR 4181/5181

UNCC 2015

Embedded Processor Characteristics

The largest class of computers spanning the widest range
of applications and performance

Often have minimum performance requirements.

Example?

Often have stringent limitations on cost. Example?

Often have stringent limitations on power consumption.
Example?

Often have low tolerance for failure. Example?

ECGR 4181/5181

UNCC 2015

Characteristics of Each Segment

Desktop: Need high performance (graphics performance?)
and low price
Servers: Need high throughput, availability (revenue
losses of $14K-$6.5M per hour of downtime), scalability
Embedded: Need low power, low cost, low memory (almost
dont care for performance, except real-time constraints)
Feature

Desktop

Server

Embedded

System price

$1000 - $10,000

$10,000 - $10,000,000

$10 - $100,000

Processor price

$100 - $1000

$200 - $2000

$0.20 - $200

Design issues

Price-perf, graphics

Throughput, availability,
scalability

Price, power, appspecific performance

ECGR 4181/5181

UNCC 2015

Parallelism
Classes of parallelism in applications:
Data-Level Parallelism (DLP)
Task-Level Parallelism (TLP)

Classes of architectural parallelism:

Instruction-Level Parallelism (ILP)
Vector architectures/Graphic Processor Units (GPUs)
Thread-Level Parallelism
Request-Level Parallelism

ECGR 4181/5181

UNCC 2015

Flynns Taxonomy
Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Vector architectures
Multimedia extensions
Graphics processor units

Multiple instruction streams, single data stream (MISD)

No commercial implementation

Multiple instruction streams, multiple data streams

(MIMD)
Tightly-coupled MIMD
Loosely-coupled MIMD

ECGR 4181/5181

UNCC 2015

Defining Computer Architecture

Old view of computer architecture:
Instruction Set Architecture (ISA) design
i.e. decisions regarding:
- registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding

Real computer architecture:

Specific requirements of the target machine
Design to maximize performance within constraints:
cost, power, and availability
Includes ISA, microarchitecture, hardware
ECGR 4181/5181

UNCC 2015

Trends in Technology
Integrated circuit technology
Transistor density: 35%/year
Die size: 10-20%/year
Integration overall: 40-55%/year

DRAM capacity: 25-40%/year (slowing)

Flash capacity: 50-60%/year
15-20X cheaper/bit than DRAM

Magnetic disk technology: 40%/year

15-25X cheaper/bit then Flash
300-500X cheaper/bit than DRAM

ECGR 4181/5181

UNCC 2015

Bandwidth and Latency

Bandwidth or throughput
Total work done in a given time
10,000-25,000X improvement for processors
300-1200X improvement for memory and disks

Latency or response time

Time between start and completion of an event
30-80X improvement for processors
6-8X improvement for memory and disks

ECGR 4181/5181

UNCC 2015

Bandwidth and Latency

Log-log plot of bandwidth and latency milestones

ECGR 4181/5181

UNCC 2015

Transistors and Wires

Feature size
Minimum size of transistor or wire in x or y dimension
10 microns in 1971 to .032 microns in 2011
Transistor performance scales linearly
- Wire delay does not improve with feature size!
Integration density scales quadratically

ECGR 4181/5181

UNCC 2015

Trends in Cost
Cost driven down by learning curve
Yield

DRAM: price closely tracks cost

Microprocessors: price depends on volume

10% less for each doubling of volume

ECGR 4181/5181

UNCC 2015

Integrated Circuit Cost

Integrated circuit

Bose-Einstein formula:

Defects per unit area = 0.016-0.057 defects per square cm (2010)

N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
ECGR 4181/5181

UNCC 2015

Measuring Performance
Typical performance metrics:
Response time
Throughput

Speedup of X relative to Y
Execution timeY / Execution timeX

Execution time
Wall clock time: includes all system overheads
CPU time: only computation time

Benchmarks
SPEC CPU 2006: cpu-oriented programs (for desktops)
SPECweb, TPC: throughput-oriented (for servers)
EEMBC: for embedded processors/workloads

ECGR 4181/5181

UNCC 2015

Principles of Computer Design

Take Advantage of Parallelism
e.g. multiple processors, disks, memory banks,
pipelining, multiple functional units

Principle of Locality
Reuse of data and instructions

Focus on the Common Case

Amdahls Law

ECGR 4181/5181

UNCC 2015

Compute Speedup Amdahls Law

Speedup is due to enhancement(E):
TimeBefore
TimeAfter

Let F be the fraction where enhancement is applied => Also,

called parallel fraction and (1-F) as the serial fraction
F
Execution timeafter = ExTime
x
[
(1-F)
+
before

Speedup(E)

ECGR 4181/5181

ExTimebefore
ExTimeafter

]
1

=
[(1-F) +

F ]
S

UNCC 2015

Principles of Computer Design

The Processor Performance Equation

ECGR 4181/5181

UNCC 2015

Principles of Computer Design

Different instruction types having different CPIs

ECGR 4181/5181

UNCC 2015

Measuring Performance

How do we conclude that System-A is better than

System-B?

ECGR 4181/5181

UNCC 2015

Performance Metrics
Purchasing perspective
given a collection of machines, which has the
best performance?
least cost?
best cost/performance?

Design perspective
faced with design options, which has the
best performance improvement?
least cost?
best cost/performance?

Both require
basis for comparison
metric for evaluation

Our goal is to understand what factors in the architecture

contribute to overall system performance and the relative
importance (and cost) of these factors
ECGR 4181/5181

UNCC 2015

Defining (Speed) Performance

Normally interested in reducing
Response time (execution time) the time between the start and
the completion of a task
- Important to individual users

Thus, to maximize performance, need to minimize execution time

performanceX = 1 / execution_timeX
If X is n times faster than Y, then

performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX
Throughput the total amount of work done in a given time
Important to data center managers

Decreasing response time almost always improves throughput

ECGR 4181/5181

UNCC 2015

Performance Factors
Want to distinguish elapsed time and the time spent on
our task
CPU execution time (CPU time) time the CPU spends
working on a task
Does not include time waiting for I/O or running other programs

CPU execution time = # CPU clock cyclesx clock cycle time

for a program
for a program
or

CPU execution time = #------------------------------------------CPU clock cycles for a program

for a program
clock rate
Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
for a program
ECGR 4181/5181

UNCC 2015

Clock Cycles per Instruction

Not all instructions take the same amount of time to
execute
One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction

# CPU clock cycles

# Instructions Average clock cycles
= for a program x
for a program
per instruction
Clock cycles per instruction (CPI) the average number
of clock cycles each instruction takes to execute
A way to compare two different implementations of the same
architecture

CPI
ECGR 4181/5181

CPI for this instruction class

A
B
C
1
2
3
UNCC 2015

Effective CPI
Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n

Overall effective CPI =

(CPIi x ICi)

i=1

Where ICi is the count (percentage) of the number of instructions

of class i executed
CPIi is the (average) number of clock cycles per instruction for
that instruction class
n is the number of instruction classes

The overall effective CPI varies by instruction mix a

measure of the dynamic frequency of instructions across
one or many programs
ECGR 4181/5181

UNCC 2015

THE Performance Equation

Our basic performance equation is then
CPU time

= Instruction_count x CPI x clock_cycle

CPU time

Instruction_count x
CPI
----------------------------------------------clock_rate

These equations separate the three key factors that

affect performance
Can measure the CPU execution time by running the program
The clock rate is usually given
Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which
we must know the implementation details
ECGR 4181/5181

UNCC 2015

Determinates of CPU Performance

CPU time

= Instruction_count x CPI x clock_cycle

Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
ECGR 4181/5181

Instruction_
count

CPI

clock_cycle

X
X
UNCC 2015

A Simple Example
Op

Freq

CPIi

Freq x CPIi

ALU

50%

.25

Load

20%

1.0

Store

10%

Branch

20%

2.2

1.6

2.0

1.95

How much faster would the machine be if a better data cache

reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to shave

a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?

CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
ECGR 4181/5181

UNCC 2015

m/c A

m/c B

m/c C

Prog 1 (secs)

Prog 2 (secs)

1000

100

Total time: 1001, 101, and 40 seconds

ECGR 4181/5181

UNCC 2015

Comparing and Summarizing Performance

How do we summarize the performance for benchmark
set with a single number?
The average of execution times that is directly proportional to
total execution time is the arithmetic mean (AM)
n

AM =

1/n

Timei

i=1

Where Timei is the execution time for the ith program of a total of
n programs in the workload
A smaller mean indicates a smaller average execution time and
thus improved performance

Guiding principle in reporting performance measurements

is reproducibility list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
ECGR 4181/5181

UNCC 2015

Normalized Execution Time

A second approach to unequal mixture of programs in the workload normalize execution times
to a reference m/c and then take the average of the normalized execution times.

Average normalized execution time either an arithmetic or geometric mean

Geometric mean
n

execution time

i =1

where execution time ratio is the execution time for the program i normalized to the reference m/c
Interesting property

X
Geometric mean( X i )
= Geometric mean i
Geometric mean(Yi )
Yi

ECGR 4181/5181

UNCC 2015

Normalized Execution Times

Geometric means of normalized execution times are consistent - does not matter which m/c is
the reference

Not true for arithmetic mean should not be used to average normalized execution times

Weighted arithmetic means weightings are proportional to execution times influenced by

frequency of use in workload, type of m/c, and size of program input. Geometric mean is
independent of running times of individual programs and any m/c can be used for normalization

Example: Comparative performance evaluation where programs were fixed and inputs are not
Competitors could calculate weighted arithmetic mean by using their best performing
benchmark for the longest input.
Geometric mean would be less misleading than arithmetic mean for such situations

ECGR 4181/5181

UNCC 2015

Normalized Execution Times

m/c A

m/c B

m/c C

Prog 1(secs)

Prog 2 (secs)

1000

100

1.0

10.0

20.0

0.1

1.0

2.0

0.05

0.5

1.0

0.1

0.02

10.0

1.0

0.2

50.0

5.0

1.0

5.05

10.01

5.05

1.0

1.1

25.03

2.75

1.0

1.00

0.63

1.0

0.63

1.58

1.0

Total

1.0

0.11

0.04

9.1

1.0

0.36

25.03

2.75

1.0

ECGR 4181/5181

UNCC 2015

Normalized Execution Times

From geometric means:

P1 and P2 perform the same on A and B true only if p1 runs 100 times for every run of
P2
For such a workload execution times on machines A and B are 1100 secs and the
execution time on m/c C is 2020 secs. Geometric means suggests C is faster than A and
B.

Conclusion In general there is no workload for three or more machines that will
match the performance predicted by the geometric means of normalized execution
times.

Another problem with GM It encourages h/w and s/w designers to focus their
attention on the benchmarks where performance is easiest to improve rather than on
the benchmarks that are slowest.
Example: cut running time from 2 seconds to 1 and from 10,000 seconds to 5,000

ECGR 4181/5181

UNCC 2015

Example
A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.
What is the speedup of the new laptop?

ECGR 4181/5181

UNCC 2015

Example

A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.Their IPCs are
listed below. The binaries are run such that each binary
gets an equal share of CPU time.
What is the speedup of the new laptop?
P1 P2 P3
Old-IPC
1.2 1.6 2.0
New-IPC
1.6 1.6 1.6

ECGR 4181/5181

UNCC 2015

Example

Old-IPC
New-IPC

ECGR 4181/5181

P1 P2
1.2 1.6
1.6 1.6

P3
2.0
1.6

AM GM
1.6 1.57
1.6 1.6

UNCC 2015

Speedup Vs. Percentage

Speedup is a ratio = old exec time / new exec time
Improvement, Increase, Decrease usually refer to
percentage relative to the baseline
= (new perf old perf) / old perf
A program ran in 100 seconds on the old laptop and in 70
seconds on the new laptop
What is the speedup?
What is the percentage increase in performance?
What is the reduction in execution time?

ECGR 4181/5181

UNCC 2015

Quantitative Principles of Computer Design

Designing for performance and cost-performance

One of the most important principle make the common case fast favor the frequent case over the
infrequent case design trade-off

How much performance can be improved by making the frequent case faster?

Amdahls Law: The performance improvement to be gained from using some faster mode of execution
is limited by the fraction of the time the faster mode can be used defines the speedup that can be
gained by using a particular feature.

speedup =

ECGR 4181/5181

performance for the entire task using the enhancement

performance for the entire task without using the performance

exection time for the entire task without using the enhancement
execution time for the entire task using the performance
UNCC 2015

Quantitative Principles of Computer Design

Depends on two factors:

The fraction of the computational time in the original machine that can be converted to take
advantage of the enhancement
Example: 20 seconds of the execution time of a program that takes 80 seconds in total
can use an enhancement, the fraction is 20/80

fraction enhancement 1
The improvement gained by the enhanced execution mode how much faster the task
would run if the enhanced mode were used for the entire program

fraction enhanced

execution time new = execution timeold (1 fraction enhanced ) +

speedup
enhanced

The overall speedup is the ratio of the execution times:

speedup overall =

ECGR 4181/5181

execution timeold
1
=
fraction enhanced
execution time new (1 - fraction
)
+
enhanced
speedup enhanced

UNCC 2015

Quantitative Principles of Computer Design

Example
Suppose we are considering an enhancement to the processor of a server
system used for Web serving. The new CPU is 10 times faster on
computation in the Web serving application than the original processor.
Assuming that the original CPU is busy with computation 40% of the time
and is waiting for I/O 60% of the time, what is the overall speedup gained
by incorporating the enhancement?
speedup overall =

1
(1 - fraction enhanced ) +

fraction enhanced
speedup enhanced

fraction enhanced = 0.4

speedup enhanced = 10
speedup overall =

ECGR 4181/5181

1
0.6 +

0.4
10

= 1.56

UNCC 2015

Quantitative Principles of Computer Design

The Amdahls Law expresses the law of diminishing returns The

incremental improvement in speedup gained by an additional improvement
in the performance of just a portion of the computation diminishes as
improvements are added.

If an enhancement is only usable for a fraction of a task, we cannot speed

up the task by more than the 1/(1-fraction)

Caution! Do not confuse fraction of time converted to use enhancement

and fraction of time after enhancement in use

ECGR 4181/5181

UNCC 2015

Quantitative Principles of Computer Design

Example: A common transformation required in graphics engines is square
root. Implementations of floating-point (FP) square root vary significantly in
performance, especially among processors designed for graphics. Suppose
FP square root (FPSQR) is responsible for 20% of the execution time of a
critical graphics benchmark. One proposal is to enhance the FPSQR
hardware and speed up this operation by a factor of 10. The other alternative
is just to try to make all FP instrcutions in the graphics processor run faster by
1.6; FP instructions are responsible for a total of 50 % of the execution time
for the application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort required for the fast
square root. Compare the two design alternatives.

speedup FPSQR =

speedup FP =

1
(1 - 0.2) +

0.2
10

1
0.5
(1 - 0.5) +
1.6

= 1.22

= 1.23

Improving the performance of the FP operations overall is slightly better because

of the higher frequency
ECGR 4181/5181

UNCC 2015

The CPU Performance Equation

Sometimes difficult to measure time consumed by different operations

Alternative Decompose CPU execution time into different components and see
how different alternatives affect these components

CPU time = CPU clock cycles for a program x clock cycle time
CPU clock cycles for a program
CPU time =
clock rate

CPI =

CPU clock cycles per program

Instruction count

In terms of number of instructions executed instruction path length or instruction

count (IC) and clock cycles per instruction (CPI)
This figure of merit provides insight into different styles of instruction sets and implementations

ECGR 4181/5181

UNCC 2015

CPU Performance Equation

CPU clock cycles for a program = IC x CPI

CPU time = IC x clock cycle time x CPI
IC CPI
CPU time =
clock rate
instrcutions clock cycles
seconds
seconds

= CPU time

=
program
instruction clock cycle program

CPU performance is dependent on three characteristics: clock cycle (or rate), clock cycles per
instruction, and instruction count

Difficult to change one parameter without affecting other parameters.

Clock cycle time hardware technology and organization
CPI organization and instruction set architecture
Instruction count ISA and compiler technology

Many performance improvement techniques improve one component of CPU performance

with small or predictable impact on the other two
ECGR 4181/5181

UNCC 2015

CPU Performance Equation

Sometimes it is useful to calculate the number of total CPU clock cycles as

CPU clock cycles = IC i CPI i

i =1

CPU time = IC i CPI i clock cycle time

i =1

CPI =

IC
i =1

CPI i

=
i =1

IC i
CPI i
IC

Note: CPI i should be measured and not just calculated from the reference table

Possible uses of CPU performance equation:

High-level performance comparisons
Back-of-the-envelope calculations
Enable architects to think about compilers and technology

ECGR 4181/5181

UNCC 2015

CPU Performance Equation

Example
Suppose we have the following measurements:

Frequency of FP operations = 25%

Average CPI of FP operations = 4.0
Average CPI of other operations = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the
average CPI of all the FP operations to 2.5. Compare the two design alternatives using the
CPU performance equation.

Note: only the CPI changes; the clock rate and IC do not change

ECGR 4181/5181

UNCC 2015

CPU Performance Equation

CPI without enhancement

IC
CPI original = CPI i i
IC
i =1
n

=(4x25%) + (1.33x75%) = 2.0

CPI for enhanced FPSQR

CPI new FPSQR = CPI original 2% (CPI old FPSQR CPI new FPSQR only )
=2.0 2% x (20 2) = 1.64
CPI for enhanced FP instructions

CPI new FP = (75% x 1.33) + (25% x 2.5) = 1.62

Speedup new FP =
=

CPI original
CPI new FP

ECGR 4181/5181

CPU timeoriginal
CPU time new FP

IC clock cycle time CPI original

IC clock cycle time CPI new FP

2.0
= 1.23
1.62
UNCC 2015

CPU Performance Equation

Example
Machine A: 40% ALU operations (1 cycle), 24% loads (1 cycle), 11% stores (2 cycles), 25%
branches (2 cycles)

Should one-cycle stores be implemented if it slows the clock by 15%?

old CPI = 0.4 + 0.24 + (0.11 x 2) + (0.25 x 2) = 1.36 cycles/instruction

new CPI = 0.4 + 0.24 + 0.11 + (0.25 x 2) = 1.25 cycles/instruction

Speedup = (IC x old CPI x old clock cycle time)/(IC x new CPI x new clock cycle time)

= (1.36 x T)/(1.25 x 1.15T) = 0.946

NO!
ECGR 4181/5181

UNCC 2015

CPU Performance Equation

How to actually measure execution time and CPI?

Execution time: time (Unix/Linux command)

Calculate CPI

Breakdown CPI into compute, memory stalls, etc. identify performance problems

CPI Breakdown Measurement

Hardware event counters

Cycle-level simulators

ECGR 4181/5181

UNCC 2015

Harmonic Mean

If performance is expressed as a rate, then the average that tracks the total execution
time is the harmonic mean:

HM =

n
n

i =1 Rate i

Note: Arithmetic mean cannot be used for rates

Example: Travel at 40 MPH for one mile and 80 MPH for one mile average is not 60 MPH

HM = 2/(1/40 + 1/80) = 53.33 MPH

ECGR 4181/5181

UNCC 2015

Littles Law

Key relationship between latency and bandwidth

The average number of objects in a queue is the product of the entry rate and the average
holding time

average number of objects = arrival rate x average holding time

ECGR 4181/5181

UNCC 2015

Power and Energy

Problem: Get power in, get power out
Thermal Design Power (TDP)
Characterizes sustained power consumption
Used as target for power supply and cooling system
Lower than peak power, higher than average power
consumption

Clock rate can be reduced dynamically to limit

power consumption
Energy per task is often a better measurement

ECGR 4181/5181

UNCC 2015

Dynamic Energy and Power

Dynamic energy
Transistor switch from 0 -> 1 or 1 -> 0
x Capacitive load x Voltage2

Dynamic power
x Capacitive load x Voltage2 x Frequency switched

Reducing clock rate reduces power, not energy

ECGR 4181/5181

UNCC 2015

Power
Intel 80386
consumed ~ 2 W
3.3 GHz Intel Core
i7 consumes 130
W
Heat must be
dissipated from 1.5
x 1.5 cm chip
This is the limit of
what can be
cooled by air

ECGR 4181/5181

UNCC 2015

Reducing Power
Techniques for reducing power:
Do nothing well
Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks
Overclocking, turning off cores

ECGR 4181/5181

UNCC 2015

Static Power
Static power consumption
Currentstatic x Voltage
Scales with number of transistors
To reduce: power gating

ECGR 4181/5181

UNCC 2015

Power Trends

The Power Wall

In CMOS IC technology

Power = Capacitive load Voltage 2 Frequency

30
ECGR 4181/5181

5V 1V

1000
UNCC 2015

Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction

Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85

4
0.85
=
=
= 0.52
2
Pold
Cold Vold Fold

The power wall

We cant reduce voltage further
We cant remove more heat
How else can we improve performance?
ECGR 4181/5181

UNCC 2015

Dependability
Module reliability
Mean time to failure (MTTF)
Mean time to repair (MTTR)
Mean time between failures (MTBF) = MTTF + MTTR
Availability = MTTF / MTBF

ECGR 4181/5181

UNCC 2015

Define and quantity dependability

Module reliability = measure of continuous service

accomplishment (or time to failure).
Two metrics

1. Mean Time To Failure (MTTF) measures Reliability

2. Failures In Time (FIT) = 1/MTTF, the rate of failures
Traditionally reported as failures per billion hours of operation

Mean Time To Repair (MTTR) measures Service

Interruption
Mean Time Between Failures (MTBF) = MTTF+MTTR

Module availability measures service as alternate

between the 2 states of accomplishment and
interruption (number between 0 and 1, e.g. 0.9)

Module availability = MTTF / ( MTTF + MTTR)

ECGR 4181/5181

UNCC 2015

Example calculating reliability

If modules have exponentially distributed lifetimes

(age of module does not affect probability of
failure), overall failure rate is the sum of failure
rates of the modules

Calculate FIT and MTTF for 10 disks (1M hour

MTTF per disk), 1 disk controller (0.5M hour
MTTF), and 1 power supply (0.2M hour MTTF):

FailureRate =

MTTF =
ECGR 4181/5181

UNCC 2015

Performance Improvement

Exploiting properties of programs

Principle of Locality programs tend to reuse data and instructions they have used
recently
Rule of thumb A program spends 90% of its execution time in only 10% of the code. We
can predict with reasonable accuracy what instructions and data a program will use in
near future based on its accesses in the recent past.
Principle of locality also applies to data accesses not as strongly as code accesses.
Two types of locality

Temporal locality recently accessed items are likely to be accessed again in

near future
Spatial locality Items whose addresses are near one another tend to be
referenced close together

ECGR 4181/5181

UNCC 2015

Performance Improvement

Take Advantage of Parallelism

One of the most important methods for improving performance

Parallelism at different levels. For example,

System level multiprocessors and multiple disks for servers workload spread among
CPUs or disks improves throughput scalability is important for server applications
Individual processors Parallelism at instruction level, such as pipelining
Parallelism at the level of detailed digital design, such as carry-lookahead ALUs and setassociative caches for parallel searches

One common class of techniques is parallel computation of two or more possible

outcomes, followed by late selection, such as carry select adders and handling
branches in pipelines

ECGR 4181/5181

UNCC 2015

Multiprocessors
Multicore microprocessors
More than one processor per chip

Requires explicitly parallel programming

Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Fallacy The relative performance of two processors with the same ISA can be judged by clock
rate or by the performance of a single benchmark suite.
As processors have become faster and more sophisticated, processor performance in one
application area can diverge from that in another area. Sometimes the ISA is responsible,
but more often the pipeline structure and memory system are responsible. Thus the clock
rate is not a good metric.
Performance of a 1.7 GHz Pentium 4 relative to a 1 GHz Pentium III

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Pitfall Comparing hand-coded assembly and compiler-generated, high-level language

performance
Embedded market several factors encourage hand-coding, at least of key loops;
importance of a few small loops to overall performance (real-time performance) in some
embedded applications. Inclusion of instructions that can significantly boost performance of
certain computations and compilers can not effectively use
Hand-coding of the critical parts can lead to large performance gains

EEMBC
benchmark set

Compilergenerated
performance

Hand-coded
performance

Ratio
hand/compiler

Trimedia 1300
@ 166MHz

Consumer

23.3

110

4.7

BOPS Manta @
136 MHz

Telecomm

2.6

225.8

86.8

TI
TMS320C6203
@ 300 MHz

Telecomm

6.8

68.5

10.1

Machine

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Fallacy Peak performance tracks performance

One definition of peak performance is that the performance level a machine is guaranteed
not to exceed. The gap between peak performance and observed performance is typically a
factor of 10 or more for supercomputers. Since the gap is large and can vary significantly,
peak performance is not useful in predicting observed performance unless the workload
consists of small programs that operate close to peak.
Peak MIPS Obtained by choosing an instruction mix that minimizes the CPI, even if the
mix is totally impractical.

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Fallacy The best design for a computer is the one that optimizes the primary objective without
considering implementation
Complex designs take longer to complete and prolong time to market design will be less
competitive

Pitfall Neglecting the cost of software in either evaluating a system or examining costperformance
For many years, hardware was so expensive that it dominated the cost of software, but this
is no longer true
For medium-size database sever the software costs are roughly 50% of the total cost while
for desktop the software cost somewhere between 23% and 38%

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Pitfall Falling prey to Amdahls Law

Expending tremendous effort optimizing some aspect of a system before measuring its
usage

Fallacy MIPS is an accurate measure for comparing performance among processors

Embedded market uses Dhrystone as the benchmark of choice and reports performance as
Dhrystone MIPS

Instruction count
clock rate
=
Exection time 106 CPI 106
IC
Execution time =
MIPS 106
MIPS =

Easy to understand
Faster machines have higher MIPS rating matches intuition

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Problems with using MIPS as a measure for comparison:

MIPS is dependent on the instruction set, making it difficult to
compare MIPS of computers with different instruction sets

MIPS varies between programs on the same computer

MIPS can vary inversely to performance. Example: MIPS rating of a

machine with optional floating-point hardware. Takes more clock
cycles per floating-point instruction than integer instruction floatingpoint programs using optional hardware instead software floatingpoint routines take less time but have a lower MIPS rating. Software
floating point executes simpler instructions resulting in higher MIPS
rating, but it executes so many instructions that the execution time is
longer.

MIPS is sometimes used by a single vendor within a single set of

machines designed for a given class of applications

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Designers started reporting relative MIPS embedded market reports for

Dhrystone

Relative MIPS for a machine M was defined based on some

reference machine as

MIPSM =

Performance M
MIPSreference
Performance reference

MFLOPS measured the inverse of execution time

ECGR 4181/5181

UNCC 2015

Other Performance Metrics

Power consumption especially in the embedded market
where battery life is important (and passive cooling)
For power-limited applications, the most important metric is
energy efficiency

ECGR 4181/5181

UNCC 2015

Summary: Evaluating ISAs

Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?

Static Metrics:
How many bytes does the program occupy in memory?

Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
How many clocks are required per instruction?
CPI
How "lean" a clock is practical?

Best Metric: Time to execute the program!

depends on the instructions set, the
processor organization, and compilation
techniques.
ECGR 4181/5181

Inst. Count

Cycle Time

UNCC 2015

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (80)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Chapter 1: Computer Abstractions and Technology
No ratings yet
Chapter 1: Computer Abstractions and Technology
50 pages
Car Maintenance Tips
No ratings yet
Car Maintenance Tips
23 pages
Torishima Pumps General-Catalog
100% (1)
Torishima Pumps General-Catalog
36 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Mod6 2 PDF
No ratings yet
Mod6 2 PDF
15 pages
722 9 5 2011 Review
No ratings yet
722 9 5 2011 Review
101 pages
Comparch 2 PDF
No ratings yet
Comparch 2 PDF
33 pages
CH 1
No ratings yet
CH 1
55 pages
Soc Design
No ratings yet
Soc Design
42 pages
ACA UNit 1
No ratings yet
ACA UNit 1
29 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
Lecture 3: Performance/Power, MIPS Instructions
No ratings yet
Lecture 3: Performance/Power, MIPS Instructions
18 pages
PF PF Performance Performance: What Is Good Performance
No ratings yet
PF PF Performance Performance: What Is Good Performance
7 pages
"Iron Law" of Processor Performance: Prof. R. Iris Bahar EN164 January 31, 2007
No ratings yet
"Iron Law" of Processor Performance: Prof. R. Iris Bahar EN164 January 31, 2007
4 pages
DR Tahir Zaidi: Targets For Algorithms
No ratings yet
DR Tahir Zaidi: Targets For Algorithms
37 pages
Digital Systems
No ratings yet
Digital Systems
45 pages
Alpha Breathing: (2 Mins) : - Breathe in - Breathe Out - Hold
No ratings yet
Alpha Breathing: (2 Mins) : - Breathe in - Breathe Out - Hold
27 pages
MObile Communication
No ratings yet
MObile Communication
61 pages
Advanced Computer Architecture: 563 L02.1 Fall 2011
No ratings yet
Advanced Computer Architecture: 563 L02.1 Fall 2011
57 pages
Levels of Integration: (Ics Are Categorized According To Number of Gates in Single Package)
No ratings yet
Levels of Integration: (Ics Are Categorized According To Number of Gates in Single Package)
11 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
CS-3006_4_PerformanceAnalysis
No ratings yet
CS-3006_4_PerformanceAnalysis
62 pages
CH 01. Fundamentals of Quantitative Design and Analysis
No ratings yet
CH 01. Fundamentals of Quantitative Design and Analysis
27 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
Intel Overclocking Guide
No ratings yet
Intel Overclocking Guide
36 pages
Lecture 1-3: Introduction To Embedded Systems
No ratings yet
Lecture 1-3: Introduction To Embedded Systems
54 pages
Performance Measures For Computers
No ratings yet
Performance Measures For Computers
53 pages
Chapter 1 COAL
No ratings yet
Chapter 1 COAL
46 pages
L36 - IO Perf Measures
No ratings yet
L36 - IO Perf Measures
10 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
Computer Architecture Unit 1
No ratings yet
Computer Architecture Unit 1
59 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Computer Abstractions and Technology
No ratings yet
Computer Abstractions and Technology
47 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
52 pages
JPMorgan FGPA Overview
No ratings yet
JPMorgan FGPA Overview
48 pages
Lecture 2: Performance/Power, MIPS Instructions
No ratings yet
Lecture 2: Performance/Power, MIPS Instructions
28 pages
Introduction To Reconfigurable Systems1
No ratings yet
Introduction To Reconfigurable Systems1
45 pages
Computer Performance
No ratings yet
Computer Performance
27 pages
CSC232 - Chp1 (Compatibility Mode)
No ratings yet
CSC232 - Chp1 (Compatibility Mode)
50 pages
CS5204/EE5364 - Advanced Computer Architecture - Performance
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Performance
56 pages
UNIT 4 JTAG, BDM and Nexus
No ratings yet
UNIT 4 JTAG, BDM and Nexus
21 pages
Control and Computer Chapter1 2013
No ratings yet
Control and Computer Chapter1 2013
45 pages
Computer Abstractions and Technology
No ratings yet
Computer Abstractions and Technology
46 pages
Ca02 2014 PDF
No ratings yet
Ca02 2014 PDF
79 pages
Abstraction & Technology_1
No ratings yet
Abstraction & Technology_1
74 pages
MP-Based Automated Systems - Lecture 6
No ratings yet
MP-Based Automated Systems - Lecture 6
55 pages
Unit 1: Introduction To Embedded System
No ratings yet
Unit 1: Introduction To Embedded System
48 pages
Computer Abstractions and Technology: Adapted by Prof. Gheith Abandah
No ratings yet
Computer Abstractions and Technology: Adapted by Prof. Gheith Abandah
35 pages
Lecture1 2
No ratings yet
Lecture1 2
30 pages
Ch.2 Performance Issues: Computer Organization and Architecture
No ratings yet
Ch.2 Performance Issues: Computer Organization and Architecture
25 pages
Management Ppts
No ratings yet
Management Ppts
14 pages
Department of Electronics and Communication Engineering Saintgits College of Engineering
No ratings yet
Department of Electronics and Communication Engineering Saintgits College of Engineering
41 pages
Necessity of High Performance Parallel Computing
No ratings yet
Necessity of High Performance Parallel Computing
44 pages
HSE-3 Soc Chip Basics - Clear
No ratings yet
HSE-3 Soc Chip Basics - Clear
50 pages
Lecture 1
No ratings yet
Lecture 1
20 pages
CS102 Section3 CPU
No ratings yet
CS102 Section3 CPU
24 pages
LECTURE 5
No ratings yet
LECTURE 5
21 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
PLC Overview PID Control and Tuning
100% (1)
PLC Overview PID Control and Tuning
64 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
GPU Overclocking Guide
From Everand
GPU Overclocking Guide
Alisa Turing
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Analog Dialogue, Volume 46, Number 2: Analog Dialogue, #6
From Everand
Analog Dialogue, Volume 46, Number 2: Analog Dialogue, #6
Analog Dialogue
No ratings yet
MSA Booklet October 2019 - 0 PDF
No ratings yet
MSA Booklet October 2019 - 0 PDF
50 pages
Fractals
No ratings yet
Fractals
20 pages
FIBA Referee Manual IOT v1 2 May2022 en
No ratings yet
FIBA Referee Manual IOT v1 2 May2022 en
36 pages
5.1A Polynomials: Basics: A. Definition of A Polynomial
No ratings yet
5.1A Polynomials: Basics: A. Definition of A Polynomial
5 pages
Inhalytix Server and Cloud Installation Manual v2.1 08-04-22
No ratings yet
Inhalytix Server and Cloud Installation Manual v2.1 08-04-22
14 pages
Henry Spencer - Berlioz's Minature Orchestra
No ratings yet
Henry Spencer - Berlioz's Minature Orchestra
11 pages
National Slope Master Plan
100% (1)
National Slope Master Plan
6 pages
Axial Piston Motors
100% (1)
Axial Piston Motors
20 pages
Variables in Java
No ratings yet
Variables in Java
165 pages
Appropriate Expressions in Meetings and Interviews
No ratings yet
Appropriate Expressions in Meetings and Interviews
48 pages
Report Tapioca Harvesting
No ratings yet
Report Tapioca Harvesting
44 pages
Changing Room Cleaning and Sanitation Schedule
No ratings yet
Changing Room Cleaning and Sanitation Schedule
4 pages
Maths (2)
No ratings yet
Maths (2)
11 pages
168900 Spec 606B312-322-362 ULTIMA Selectronic Concealed Toilet Flush Valves Original
No ratings yet
168900 Spec 606B312-322-362 ULTIMA Selectronic Concealed Toilet Flush Valves Original
2 pages
Blog - Peek N Fothergill - Using Focus Groups For Qualitative Research
No ratings yet
Blog - Peek N Fothergill - Using Focus Groups For Qualitative Research
4 pages
Trav. Pre Int.
No ratings yet
Trav. Pre Int.
4 pages
Department of Computer Science Faculty of Natural Sciences University of Guyana
No ratings yet
Department of Computer Science Faculty of Natural Sciences University of Guyana
5 pages
EEE 2302 Analogue Electronics III
No ratings yet
EEE 2302 Analogue Electronics III
7 pages
Large Language Models For Education: A Survey: Xu, Gan, Qi, Wu and Yu
No ratings yet
Large Language Models For Education: A Survey: Xu, Gan, Qi, Wu and Yu
19 pages
Parse - Ly Content Matters Report 2022
No ratings yet
Parse - Ly Content Matters Report 2022
27 pages
Data Sheet For Radar Level Transmitters: Offsites Engineering Works For The Erbil Refinery 40,000 B/D Expansion Project
No ratings yet
Data Sheet For Radar Level Transmitters: Offsites Engineering Works For The Erbil Refinery 40,000 B/D Expansion Project
30 pages
Insurance Sector in India and Effects of Global Events On Commodity Market
No ratings yet
Insurance Sector in India and Effects of Global Events On Commodity Market
9 pages
IPE 4706- Ergonomics and Safety Management Lab All Experiments
No ratings yet
IPE 4706- Ergonomics and Safety Management Lab All Experiments
57 pages
C# Core Programs
No ratings yet
C# Core Programs
80 pages
Immunoglobulins: Dr. Pendru Raghunath
No ratings yet
Immunoglobulins: Dr. Pendru Raghunath
26 pages
Blank DLL
No ratings yet
Blank DLL
31 pages