0% found this document useful (0 votes)
39 views

Advanced Computer Architecture Fundamentals of Computer Design

This document provides an outline for a lecture on advanced computer architecture. It discusses computer science being at a crossroads, the differences between computer architecture and instruction set architecture, trends in technologies like processors, memory, disks, and networks over the past 20 years. It shows how bandwidth has increased much more than latency. The document also covers trends in power for integrated circuits, dependability, measuring and reporting performance, and quantitative principles of computer design.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Advanced Computer Architecture Fundamentals of Computer Design

This document provides an outline for a lecture on advanced computer architecture. It discusses computer science being at a crossroads, the differences between computer architecture and instruction set architecture, trends in technologies like processors, memory, disks, and networks over the past 20 years. It shows how bandwidth has increased much more than latency. The document also covers trends in power for integrated circuits, dependability, measuring and reporting performance, and quantitative principles of computer design.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Advanced Computer Architecture

Fundamentals of Computer Design


Myung Hoon Sunwoo
School of Electrical and Computer Engineering
Ajou University

Outline

Computer Science at a Crossroads


Computer Architecture v. Instruction Set Arch.
Trends in Technology
gy
Trends in Power in Integrated Circuits
Dependability
Measuring Reporting,
Measuring,
Reporting and Summarizing Performance
Quantitative Principles of Computer Design

Ajou Univ.

Multimedia
Communications

SOC Lab.

Crossroads: Uniprocessor Performance


10000

Perfformance (vs. VA
AX-11/780)

From Hennessy and Patterson, Computer


Architecture: A Quantitative Approach, 4th
edition,
diti October,
O t b 2006

??%/year

1000
52%/year
100

10

25%/year

1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

VAX
: 25%/year 1978 to 1986
RISC + x86:
86 52%/
52%/year 1986 tto 2002
RISC + x86: ??%/year 2002 to present
Ajou Univ.

Multimedia
Communications

SOC Lab.

Outline

Computer Science at a Crossroads


Computer Architecture v. Instruction Set Arch.
Trends in Technology
gy
Trends in Power in Integrated Circuits
Dependability
Measuring Reporting,
Measuring,
Reporting and Summarizing Performance
Quantitative Principles of Computer Design

Ajou Univ.

Multimedia
Communications

SOC Lab.

Instruction Set Architecture: Critical Interface

software

instruction set

hardware

Properties of a good abstraction

Ajou Univ.

Lasts through many generations (portability)


Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
5

Multimedia
Communications

SOC Lab.

Example: MIPS
r0
r1

r31
PC
lo
hi

Programmable storage

Data types ?

2^32 x bytes

Format ?

31 x 32
32-bit
bit GPRs (R0=0)

Addressing Modes?

32 x 32-bit FP regs (paired DP)

Operations?

HI, LO, PC
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,
AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
SLL, SRL, SRA, SLLV, SRLV, SRAV

Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR
SB, SH, SW, SWL, SWR

Control

32-bit instructions on word boundary

J, JAL, JR, JALR


BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
Ajou Univ.

Multimedia
Communications

SOC Lab.

Instruction Set Architecture


... the attributes of a [computing] system as seen by the programmer,
i.e. the conceptual structure and functional behavior, as distinct from
the organization of the data flows and controls, the logic design, and
the physical implementation
implementation.

Amdahl, Blaauw, and Brooks, 1964

SOFTWARE
-- Organization of Programmable
Storage
g
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
Ajou Univ.

Multimedia
Communications

SOC Lab.

ISA vs. Computer Architecture


Old definition of computer architecture
= instruction set design
Other aspects of computer design called implementation
Insinuates implementation is uninteresting or less challenging

Our view is computer architecture >> ISA


Architects jjob much more than instruction set design;
g ; technical
hurdles today more challenging than those in instruction set design
Since instruction set design not where action is, some conclude
computer architecture (using old definition) is not where action is
We
W disagree
di
on conclusion
l i
Agree that ISA not where action is (ISA in CA:AQA 4/e appendix)

Ajou Univ.

Multimedia
Communications

SOC Lab.

Comp. Arch. is an Integrated Approach


What really matters is the functioning of the complete system
hardware, runtime system, compiler, operating system, and application
In networking, this is called the End to End argument

Computer architecture is not just about transistors, individual


instructions or particular implementations
instructions,
E.g., Original RISC projects replaced complex instructions with a compiler +
simple instructions

Ajou Univ.

Multimedia
Communications

SOC Lab.

Outline

Computer Science at a Crossroads


Computer Architecture v. Instruction Set Arch.
Trends in Technology
gy
Trends in Power in Integrated Circuits
Dependability
Measuring Reporting,
Measuring,
Reporting and Summarizing Performance
Quantitative Principles of Computer Design

Ajou Univ.

10

Multimedia
Communications

SOC Lab.

Moores Law: 2X transistors / year

Cramming More Components onto Integrated Circuits


Gordon Moore, Electronics, 1965

# on transistors / cost-effective integrated circuit double every N months (12 N 24)

Ajou Univ.

11

Multimedia
Communications

SOC Lab.

Tracking Technology Performance Trends


Drill down into 4 technologies:

Disks,
Memory,
Network,
Processors

Compare ~1980 Archaic (Nostalgic) vs


vs. ~2000 Modern (Newfangled)
Performance Milestones in each technology

Compare for Bandwidth vs. Latency improvements in performance over


time
ti
Bandwidth: number of events per unit time
E.g.,
E g M bits / second over network,
network M bytes / second from disk

Latency: elapsed time for a single event


E.g., one-way network delay in microseconds,
average disk access time in milliseconds

Ajou Univ.

12

Multimedia
Communications

SOC Lab.

Disks: Archaic(Nostalgic) v. Modern(Newfangled)

CDC Wren I, 1983


3600 RPM
0 03 GB
0.03
GBytes capacity
i
Tracks/Inch: 800
Bits/Inch: 9550
Three 5.25 platters

Bandwidth:
0.6 MBytes/sec
Latency: 48.3 ms
Cache: none

Ajou Univ.

13

Seagate 373453, 2003


15000 RPM
73 4 GB
73.4
GBytes
Tracks/Inch: 64000
Bits/Inch: 533,000
Four 2.5 platters
(in 3.5 form factor)
Bandwidth:
86 MBytes/sec
Latency: 5.7 ms
Cache: 8 MBytes
y

(4X)
(2500X)
(80X)
(60X)

(140X)
(8X)

Multimedia
Communications

SOC Lab.

Latency Lags Bandwidth (for last ~20 years)


Performance Milestones

10000

1000

Relative
BW
100
Improve
ment

Disk

10

Disk: 3600, 5400, 7200, 10000, 15000


RPM (8x, 143x)

(Latency improvement
= Bandwidth improvement)

1
1

10

100

Relative Latency Improvement

Ajou Univ.

(latency = simple operation w/o contention


BW = best-case)
14

Multimedia
Communications

SOC Lab.

Memory: Archaic (Nostalgic) v. Modern (Newfangled)


1980 DRAM
(asynchronous)
0.06 Mbits/chip
64,000 xtors, 35 mm2
16-bit data bus per module
module,
16 pins/chip
13 Mbytes/sec
Latency:
L t
225 ns
(no block transfer)

Ajou Univ.

15

2000 Double Data Rate Synchr.


(clocked) DRAM
256.00 Mbits/chip
(4000X)
256,000,000 xtors, 204 mm2
64-bit data bus per
DIMM, 66 pins/chip
(4X)
1600 Mbytes/sec
(120X)
Latency: 52 ns
(4X)
Block transfers (page mode)

Multimedia
Communications

SOC Lab.

Latency Lags Bandwidth (for last ~20 years)


Performance Milestones

10000

1000

Relative
Memory
BW
100
Improve
ment

Memory Module: 16bit plain DRAM,


Page Mode DRAM, 32b, 64b, SDRAM,
DDR SDRAM (4x,120x)
Disk: 3600, 5400, 7200, 10000, 15000
RPM (8x, 143x)

Disk

10

(Latency improvement
= Bandwidth improvement)

1
1

10

100

(
(latency
y = simple
p operation
p
w/o contention
BW = best-case)

Relative Latency Improvement

Ajou Univ.

16

Multimedia
Communications

SOC Lab.

LANs: Archaic (Nostalgic)v. Modern (Newfangled)


Ethernet 802.3
Year of Standard: 1978
10 Mbit
Mbits/s
/
link speed
Latency: 3000 sec
Shared
S
media
Coaxial cable
Coaxial Cable:

"Cat 5" is 4 twisted pairs in bundle


Plastic Covering
Braided outer conductor
Insulator

Copper core

Ajou Univ.

Ethernet 802.3ae
Year of Standard: 2003
10 000 Mbit
10,000
Mbits/s
/
(1000X)
link speed
Latency: 190 sec
(15X)
S
Switched
media
Category 5 copper wire

17

Twisted Pair:

Copper, 1mm thick,


twisted to avoid antenna effect

Multimedia
Communications

SOC Lab.

Latency Lags Bandwidth (for last ~20 years)


Performance Milestones

10000

1000

Ethernet: 10Mb, 100Mb, 1000Mb,


10000 Mb/s (16x,1000x)
Memory Module: 16bit plain DRAM,
Page Mode DRAM, 32b, 64b, SDRAM,
DDR SDRAM ((4x,120x))
Disk: 3600, 5400, 7200, 10000, 15000
RPM (8x, 143x)

Network
Relative
Memory
BW
100
Improve
ment

Disk

10

(Latency improvement
= Bandwidth improvement)

1
1

10

100

Relative Latency Improvement

Ajou Univ.

(latency = simple operation w/o contention


BW = best-case)
18

Multimedia
Communications

SOC Lab.

CPUs: Archaic (Nostalgic) v. Modern (Newfangled)


1982 Intel 80286
12.5 MHz
2 MIPS (peak)
Latency 320 ns
134,000 xtors, 47 mm2
16-bit data bus, 68 pins
Microcode interpreter,
separate FPU chip
(no caches)

Ajou Univ.

19

2001 Intel Pentium 4


1500 MHz
(120X)
4500 MIPS (peak)
(2250X)
Latency 15 ns
(20X)
42,000,000 xtors, 217 mm2
64-bit data bus, 423 pins
3-way superscalar,
Dynamic translate to RISC,
S
Superpipelined
i li d (22 stage),
t
)
Out-of-Order execution
On-chip 8KB Data caches,
96KB Instr.
Instr Trace cache
cache,
256KB L2 cache

Multimedia
Communications

SOC Lab.

Latency Lags Bandwidth (for last ~20 years)


Performance Milestones
Processor: 286, 386, 486, Pentium,
Pentium Pro,
Pro Pentium 4 (21x
(21x,2250x)
2250x)
Ethernet: 10Mb, 100Mb, 1000Mb,
10000 Mb/s (16x,1000x)
Memory Module: 16bit plain DRAM,
DRAM
Page Mode DRAM, 32b, 64b, SDRAM,
DDR SDRAM (4x,120x)
Disk : 3600, 5400, 7200, 10000, 15000
RPM (8x, 143x)

10000

CPU high,
Memory low
(Memory
Wall) 1000

Processor

Network
Relative
Memory
BW
100
Improve
ment

Disk

10

(Latency improvement
= Bandwidth improvement)

1
1

10

100

Relative Latency Improvement

Ajou Univ.

20

Multimedia
Communications

SOC Lab.

Outline

Computer Science at a Crossroads


Computer Architecture v. Instruction Set Arch.
Trends in Technology
gy
Trends in Power in Integrated Circuits
Dependability
Measuring Reporting,
Measuring,
Reporting and Summarizing Performance
Quantitative Principles of Computer Design

Ajou Univ.

21

Multimedia
Communications

SOC Lab.

Define and quantity power ( 1 / 2)


For CMOS chips, traditional dominant energy consumption has been
in switching transistors, called dynamic power
2

P
Power
dynamic = 1 / 2 Capacitive
C
i i Load
L d Voltage
F
FrequencySSwitched
i h d
V l
For mobile devices, energy better metric
2

Energydynamic
d
i = CapacitiveLoad Voltage

For a fixed task, slowing clock rate (frequency switched) reduces


power, but not energy
Capacitive load a function of number of transistors connected to
output and technology, which determines capacitance of wires and
transistors
Dropping voltage helps both, so went from 5V to 1V
To save energy & dynamic power, most CPUs now turn off clock of
inactive modules (e.g. Fl. Pt. Unit)
Ajou Univ.

22

Multimedia
Communications

SOC Lab.

Example of quantifying power


Suppose 15% reduction in voltage results in a 15% reduction in
frequency. What is impact on dynamic power?

Powerdynamic = 1 / 2 CapacitiveLoad Voltage FrequencySwitched


2

C
L d (.
Load
FrequencySSwitched
h d
= 1 / 2 .85 Capacitive
( 85
8 Voltage
l
) F
2

= (.85)3 OldPowerdynamic
0.6 OldPowerdynamic

Ajou Univ.

23

Multimedia
Communications

SOC Lab.

Define and quantity power (2 / 2)


Because leakage current flows even when a transistor is off, now
static power important too

Powerstatic = Currentstatic Voltage


V lt
Leakage current increases in processors with smaller transistor sizes
Increasing the number of transistors increases power even if they are
turned off
IIn 2006,
2006 goall ffor lleakage
k
iis 25% off ttotal
t l power consumption;
ti
hi
high
h
performance designs at 40%
Very low power systems even gate voltage to inactive modules to
control loss due to leakage

Ajou Univ.

24

Multimedia
Communications

SOC Lab.

Define and quantity dependability (1/3)

How decide when a system is operating properly?

IInfrastructure
f
t
t
providers
id
now offer
ff Service
S
i Level
L
l Agreements
A
t (SLA)
to guarantee that their networking or power service would be
dependable

Systems alternate between 2 states of service with respect to an


SLA:
Service accomplishment, where the service is delivered as
specified
ifi d in
i SLA
Service interruption, where the delivered service is different from
the SLA

1.
2.

Ajou Univ.

Failure = transition from state 1 to state 2


Restoration = transition from state 2 to state 1

25

Multimedia
Communications

SOC Lab.

Define and quantity dependability (2/3)

Module reliability = measure of continuous service accomplishment


(or time to failure).
2 metrics
1.Mean Time To Failure (MTTF) measures Reliability
2.Failures In Time (FIT) = 1/MTTF, the rate of failures

Traditionally reported as failures per billion hours of operation

Mean Time To Repair (MTTR) measures Service Interruption

Mean Time Between Failures (MTBF) = MTTF+MTTR

Module availability measures service as alternate between the 2


states of accomplishment and interruption (number between 0 and 1,
e.g. 0.9)

Module availability = MTTF / ( MTTF + MTTR)

Ajou Univ.

26

Multimedia
Communications

SOC Lab.

Example calculating reliability

If modules have exponentially distributed lifetimes (age of


module does not affect probability of failure), overall failure
rate is the sum of failure rates of the modules
Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1
disk controller (0.5M hour MTTF), and 1 power supply (0.2M
hour MTTF):

FailureRate = 10 (1 / 1,000,000) + 1 / 500,000 + 1 / 200,000


= 10 + 2 + 5 / 1,000,000
= 17 / 1,000,000
= 17,000 FIT
MTTF = 1,000,000,000 / 17,000
59,000hours
Ajou Univ.

27

Multimedia
Communications

SOC Lab.

Outline

Computer Science at a Crossroads


Computer Architecture v. Instruction Set Arch.
Trends in Technology
gy
Trends in Power in Integrated Circuits
Dependability
Measuring Reporting,
Measuring,
Reporting and Summarizing Performance
Quantitative Principles of Computer Design

Ajou Univ.

28

Multimedia
Communications

SOC Lab.

Definition: Performance
Performance is in units of things per sec
bigger is better
If we are primarily concerned with response time

performance(x) =

1
execution_time(x)

" X is n times faster than Y" means


Performance(X)
n

=
Performance(Y)

Ajou Univ.

Execution_time(Y)

29

Execution_time(X)

Multimedia
Communications

SOC Lab.

Performance: What to measure


Usually
U
ll rely
l on b
benchmarks
h
k vs. reall workloads
kl d
To increase predictability, collections of benchmark applications,
called benchmark suites
suites, are popular
SPECCPU: popular desktop benchmark suite

CPU only
only, split between integer and floating point programs
SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
SPECCPU2006 to be announced Spring 2006
SPECSFS (NFS file server) and SPECWeb (WebServer) added as server
benchmarks

Transaction Processing Council measures server performance and


cost-performance for databases

Ajou Univ.

TPC-C Complex query for Online Transaction Processing


TPC-H models ad hoc decision support
TPC-W a transactional web benchmark
TPC-App application server and web services benchmark
30

Multimedia
Communications

SOC Lab.

How Summarize Suite Performance (1/5)


Arithmetic average of execution time of all pgms?
But they vary by 4X in speed, so some would be more important than
g
others in arithmetic average

Could add a weights per program, but how pick weight?


Different companies want different weights for their products

SPECRatio: Normalize execution times to reference computer,


yielding a ratio proportional to performance =
ti
time
on reference
f
computer
t
time on computer being rated

Ajou Univ.

31

Multimedia
Communications

SOC Lab.

How Summarize Suite Performance (2/5)


If program SPECRatio on Computer A is 1.25 times bigger than
Computer B, then

ExecutionT
E
Timereference
SPECRatio A
ExecutionTime A
1.25 =
=
SPECRatioB ExecutionTimereference
ExecutionTimeB
ExecutionTimeB Performance A
=
=
ExecutionTime A Performanc
f
eB
Note that when comparing 2 computers as a ratio, execution times
on the reference computer drop out, so choice of reference
computer is irrelevant
Ajou Univ.

32

Multimedia
Communications

SOC Lab.

How Summarize Suite Performance (3/5)


Since ratios, proper mean is geometric mean
(SPECRatio unitless, so arithmetic mean meaningless)

GeometricMean = n

SPECRatio

i =1

1. Geometric mean of the ratios is the same as the ratio of the


geometric means
2. Ratio of geometric means
= Geometric mean of performance ratios
choice of reference computer is irrelevant!

Ajou Univ.

These two points make geometric mean of ratios attractive to


summarize performance

33

Multimedia
Communications

SOC Lab.

How Summarize Suite Performance (4/5)


Does a single mean well summarize performance of programs in
benchmark suite?
Can
C d
decide
id if mean a good
d predictor
di t by
b characterizing
h
t i i variability
i bilit off
distribution using standard deviation
Like geometric mean, geometric standard deviation is multiplicative
rather than arithmetic
Can simply take the logarithm of SPECRatios, compute the standard
mean and standard deviation, and then take the exponent to convert
back:

1 n

GeometricMean = exp ln (SPECRatioi )


n i =1

GeometricStDev = exp(StDev(ln (SPECRatioi )))

Ajou Univ.

34

Multimedia
Communications

SOC Lab.

How Summarize Suite Performance (5/5)


Standard deviation is more informative if know distribution has a
standard form
bell-shaped
bell shaped normal distribution, whose data are symmetric around
mean
lognormal distribution, where logarithms of data--not data itself--are
normally distributed (symmetric) on a logarithmic scale

For a lognormal distribution, we expect that


68% of samples fall in range [mean / gstdev, mean gstdev]
95% of samples fall in range mean / gstdev 2 , mean gstdev 2

Note: Excel provides functions EXP(), LN(), and STDEV() that


make calculating geometric mean and multiplicative standard
deviation easy

Ajou Univ.

35

Multimedia
Communications

SOC Lab.

Example Standard Deviation (1/2)


GM and multiplicative StDev of SPECfp2000 for Itanium 2
14000

SPE
ECfpRatio
o

12000
10000

GM = 2712
GSTEV = 1.98

8000
6000

5362

4000
2712
2000

1372

Ajou Univ.

36

apsi

sixt rack

fm
ma3d

lu
ucas

am
mmp

faccerec

equ
uake

art

ga
algel

mesa
m

applu
a

mgrid
m

sswim

wupw
wise

Multimedia
Communications

SOC Lab.

Example Standard Deviation (2/2)


GM and multiplicative StDev of SPECfp2000 for AMD Athlon
14000

SPE
ECfpRatio
o

12000
10000

GM = 2086
GSTEV = 1.40

8000
6000
4000

2911
2086
1494

2000

Ajou Univ.

37

apsi

sixttrack

fm
ma3d

lu
ucas

am
mmp

faccerec

equ
uake

art

ga
algel

mesa
m

applu
a

mgrid
m

sswim

wupwise

Multimedia
Communications

SOC Lab.

Outline

Computer Science at a Crossroads


Computer Architecture v. Instruction Set Arch.
Trends in Technology
gy
Trends in Power in Integrated Circuits
Dependability
Measuring Reporting,
Measuring,
Reporting and Summarizing Performance
Quantitative Principles of Computer Design

Ajou Univ.

38

Multimedia
Communications

SOC Lab.

1) Taking Advantage of Parallelism


Increasing throughput of server computer via multiple processors or
multiple disks
Detailed HW design
Carry lookahead adders uses parallelism to speed up computing sums
from linear to logarithmic in number of bits per operand
Multiple memory banks searched in parallel in set-associative caches

Pipelining: overlap instruction execution to reduce the total time to


complete an instruction sequence.
sequence
Not every instruction depends on immediate predecessor executing
instructions completely/partially in parallel possible
Classic 5-stage pipeline:
1)) Instruction Fetch (Ifetch),
(f
)
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)

Ajou Univ.

39

Multimedia
Communications

SOC Lab.

Pipelined Instruction Execution


Time (clock cycles)

Ajou Univ.

Ifetch

DMem

Reg

DMem

Reg

DMem

Reg

ALU

Reg

ALU

O
r
d
e
r

Ifetch

ALU

I
n
s
t
r.

ALU

C l 1 Cycle
Cycle
C l 2 Cycle
C l 3 Cycle
C l 4 Cycle
C l 5 Cycle
C l 6 Cycle
C l 7

Ifetch

Ifetch

40

Reg

Reg

Reg

DMem

Reg

Multimedia
Communications

SOC Lab.

Limits to pipelining
Hazards prevent next instruction from executing during its
designated clock cycle
Structural hazards: attempt to use the same hardware to do two
different things at once
Data hazards: Instruction depends on result of prior instruction still in
the pipeline
Control hazards: Caused by delay between the fetching of instructions
and decisions about changes in control flow (branches and jumps).

Reg

DMem

Ifetch

Reg

DMem

Ifetch

Reg

ALU

DMem

Ifetch

Reg

A
ALU

Ifetch

ALU
U

I
n
s
t
r.

ALU

Ti
Time
((clock
l k cycles)
l )

O
r
d
e
r

Ajou Univ.

41

Reg
Reg
Reg
DMem

Reg

Multimedia
Communications

SOC Lab.

2) The Principle of Locality


The Principle of Locality:
Program access a relatively small portion of the address space at any
instant of time
time.

Two Different Types of Locality:


Temporal Locality (Locality in Time): If an item is referenced, it will tend to
be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, items whose
addresses are close by tend to be referenced soon
(e.g., straight-line code, array access)

Last 30 years, HW relied on locality for memory perf.

Ajou Univ.

MEM

42

Multimedia
Communications

SOC Lab.

Levels of the Memory Hierarchy

Capacity
p
y
Access Time
Cost

CPU Registers

100s Bytes
300 500 ps (0.3-0.5
(0 3-0 5 ns)

L1 and L2 Cache

10s-100s K Bytes
~1 ns - ~10 ns
$1000s/ GByte

Staging
Xfer Unit

I t O
Instr.
Operands
d
L1 Cache
Blocks

Disk
D
s

10s T Bytes, 10 ms
(10,000,000 ns)
~ $1 / GByte

Tape

iinfinite
fi i
sec-min
~$1 / GByte

Ajou Univ.

prog./compiler
1-8 bytes

f t
faster

cache cntl
y
32-64 bytes

L2 Cache
Blocks

Main Memory

G Bytes
80ns- 200ns
~ $100/ GByte

Upper Level

Registers

cache cntl
64-128 bytes

Memory
Pages

OS
4K-8K bytes

Files

user/operator
Mbytes

Disk

Tape

Larger

Lower Level
43

Multimedia
Communications

SOC Lab.

3) Focus on the Common Case


Common sense guides computer design
Since its engineering, common sense is valuable

In making a design trade-off,


trade off, favor the frequent case over the
infrequent case
E.g., Instruction fetch and decode unit used more frequently than
multiplier, so optimize it 1st
E.g.,
g , If database server has 50 disks / processor,
p
, storage
g dependability
p
y
dominates system dependability, so optimize it 1st

Frequent case is often simpler and can be done faster than the
infrequent case
E.g., overflow is rare when adding 2 numbers, so improve performance
by optimizing more common case of no overflow
May slow down overflow, but overall performance improved by
optimizing for the normal case

What is frequent
q
case and how much performance
p
improved
p
by
y
making case faster => Amdahls Law

Ajou Univ.

44

Multimedia
Communications

SOC Lab.

4) Amdahls Law
ExTimenew

Fraction enhanced
(
)
= ExTimeold 1 Fractionenhanced +

Speedup
p
p

enhanced

Speedupoverall =

ExTimeold
ld
=
ExTimenew

(1 Fractionenhanced ) +

Fraction enhanced

Speedupenhanced

Best you could ever hope to do:


Sp d pmaximum =
Speedup

Ajou Univ.

1
(1 - Fractionenhanced )

45

Multimedia
Communications

SOC Lab.

Amdahls Law example


New CPU 10X faster
I/O bound server, so 60% time waiting for I/O

Speedup overall =

1
Fraction enhanced
(1 Fraction enhanced ) +
Speedup enhanced
1

1
=
= 1.56
=
0.4 0.64
(1 0.4) +
10
Apparently, its human nature to be attracted by 10X
faster, vs. keeping in perspective its just 1.6X faster

Ajou Univ.

46

Multimedia
Communications

SOC Lab.

5) Processor performance equation


inst count
CPU time

= Seconds

= Instructions x

Program

Program

CPI

Program
Compiler

(X)

Inst. Set.

X
X

Technolog
Technology
Ajou Univ.

Cycle time

x Seconds

Instruction

Inst Count
X

Organization

Cycles

CPI

Cycle

Clock Rate

X
X

47

Multimedia
Communications

SOC Lab.

Whats a Clock Cycle?

Latch
L
t h
or
register

combinational
logic

Old days: 10 levels of gates


Today: determined by numerous time-of-flight issues + gate
delays
clock propagation, wire lengths, drivers

Ajou Univ.

48

Multimedia
Communications

SOC Lab.

You might also like