0% found this document useful (0 votes)
16 views

1-Intro 2

This document provides an overview of an advanced computer architecture course. It discusses the instructors, topics covered, course goals, expectations, and components including readings, homework, projects, exams, and grading. The course will cover state-of-the-art computer hardware design fundamentals, current systems, and future systems.

Uploaded by

Omar Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

1-Intro 2

This document provides an overview of an advanced computer architecture course. It discusses the instructors, topics covered, course goals, expectations, and components including readings, homework, projects, exams, and grading. The course will cover state-of-the-art computer hardware design fundamentals, current systems, and future systems.

Uploaded by

Omar Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ECE 252 / CPS 220 Administrivia

Advanced Computer Architecture I


• addresses, email, website, etc.
Fall 2003 • list of topics
Duke University • expected background
• course requirements
Instructor: Prof. Daniel Sorin ([email protected]) • grading and academic misconduct

based on slides developed by

Profs. Roth, Hill, Wood, Sohi, Smith, Lipasti, and Vijaykumar

© 2003 by by Sorin, Roth, Hill, Wood, ECE 252/ CPS 220 Lecture Notes 1 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 2
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Instructors and Course Website Where to Get Answers


instructor: Prof. Daniel Sorin ([email protected]) Consult course resources in this order:
• office: 1111 Hudson Hall • Course Website (http://www.ee.duke.edu/~sorin/ece252)
• office hours: Weds 3-4, Thurs 10-11 • Course Newsgroup (duke.courses.ece252)
• TA: Phil Paik ([email protected])
TA: Phil Paik ([email protected]) • Professor Sorin
• no office hours - send questions by email

Email to TA and Professor must have subject that begins with


website: http://www.ee.duke.edu/~sorin/ece252 ECE252
• info on schedule, lectures, readings, etc. • No other email will be answered!
• please become familiar with this website
• don’t email me or Phil without first checking website for the answer

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 3 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 4
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
What is This Course All About? Course Goals and Expectations
State-of-the-art computer hardware design Course Goals
Topics + Understand how current processors work
• Uniprocessor architecture (i.e., microprocessors) + Understand how to evaluate/compare processors
• Memory architecture + Learn how to use simulator to perform experiments
• I/O architecture + Learn research skills by performing term project
• Brief look at multithreading and multiprocessors
Course expectations:
Fundamentals, current systems, and future systems • Will loosely follow text
Will read from textbook, classic papers, brand-new papers • Major emphasis on cutting-edge issues
• Students will read a list of research papers
• Term project
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 5 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 6
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Expected Background Course Components


• Basic architecture (ECE 152 / CPS 104) Reading Materials
• Basic OS (ECE 153 / CPS 110) • Computer Architecture: A Quantitative Approach by
Hennessy and Patterson, 3rd Edition
• Readings in Computer Architecture by Hill, Jouppi, Sohi
Other useful and related courses:
• Recent research papers (online)
• Digital system design (ECE 251)
• VLSI systems (ECE 261) Homework
• Multiprocessor architecture (ECE 259 / CPS 221) • 4 to 6 homework assignments
• Fault tolerant computing (ECE 254) Project
• Computer networks and systems (CPS 114 & 214) • Groups of 3 or 2
• Programming languages & compilers (CS 106 & 206) Exams
• Advanced OS (CPS 210) • Midterm and final exam, in class

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 7 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 8
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Grading Conduct
Grading breakdown Lectures
• Homework: 30% • Consult course website for tentative schedule
• Midterm: 15% Academic Misconduct
• Project: 25% • University policy will be followed strictly
• Final: 30% • Zero tolerance for cheating and/or plagiarism
Late policy
• Late homeworks
• late <1 day = 50% off
• late >1 day = zero
• No late term project will be accepted. Period.
Now, moving on to computer architecture ...

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 9 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 10
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Question What is Computer Architecture?


What do these two time periods have in common? The term architecture is used here to describe the attributes of a
system as seen by the programmer, i.e., the conceptual structure
• big bang–2001 and functional behavior as distinct from the organization of the
• 2001–2003 dataflow and controls, the logic design, and the physical
implementation.
–Gene Amdahl, IBM Journal of R&D, Apr 1964

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 11 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 12
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Architecture and Other Disciplines Levels of Computer Architecture
architecture
Application Software
• functional appearance to immediate user
• opcodes, addressing modes, architected registers
Operating Systems, Compilers, Networking Software
implementation (microarchitecture)
Computer Architecture • logical structure that performs the architecture
• pipelining, functional units, caches, physical registers

Circuits, Wires, Devices, Network Hardware realization (circuits)


• physical structure that embodies the implementation
• gates, cells, transistors, wires
architecture interacts with many other fields
• can’t be studied in a vacuum

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 13 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 14
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Role of the Computer Microarchitect Applications -> Requirements -> Designs


architect: defines the hardware/software interface • scientific: weather prediction, molecular modeling
• need: large memory, floating-point arithmetic
microarchitect: defines the hardware implementation • examples: CRAY-1, T3E, IBM DeepBlue, BlueGene
• usually the same person • commercial: inventory, payroll, web serving, e-commerce
• need: integer arithmetic, high I/O
decisions based on
• examples: SUN SPARCcenter, Enterprise
• applications
• desktop: multimedia, games, entertainment
• performance • need: high data bandwidth, graphics
• cost • examples: Intel Pentium4, IBM Power4, Motorola PPC 620

• reliability • mobile: laptops


• need: low power (battery), good performance
• power . . . • examples: Intel Mobile Pentium III, Transmeta TM5400
• embedded: cell phones, automobile engines, door knobs
• need: low power (battery + heat), low cost
• examples: Compaq/Intel StrongARM, X-Scale, Transmeta TM3200
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 15 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 16
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Why Study Computer Architecture? Why Study Computer Architecture?
answer #1: requirements are always changing answer #2: technology playing field is always changing

aren’t computers fast enough already? • annual technology improvements (approximate)


• SRAM (logic): density +25%, speed +20%
• are they? • DRAM (memory): density + 60%, speed: + 4%
• fast enough to do everything we will EVER want? • disk (magnetic): density +25%, speed: + 4%
• AI, VR, protein sequencing, ???? • fiber: ??
• parameters change and change relative to one another!
is speed the only goal?
• power: heat dissipation + battery life
• cost designs change even if requirements fixed
• reliability but requirements are not fixed
• etc.

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 17 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 18
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Examples of Changing Designs Moore’s Law


example I: caches “Cramming More Components onto Integrated Circuits”
• 1970: 10K transistors, DRAM faster than logic -> bad idea –G.E. Moore, Electronics, 1965
• 1990: 1M transistors, logic faster than DRAM -> good idea
• will caches ever be a bad idea again? • observation: (DRAM) transistor density doubles annually
example II: out-of-order execution • became known as “Moore’s Law”
• wrong—density doubles every 18 months (had only 4 data points)
• 1985: 100K transistors + no precise interrupts -> bad idea
• corollaries
• 1995: 2M transistors + precise interrupts -> good idea • cost / transistor halves annually (18 months)
• 2005: 100M transistors + 10GHz clock -> bad idea? • power per transistor decreases with scaling
• speed increases with scaling
semiconductor technology is an incredible driving force • reliability increases with scaling (depends how small!)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 19 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 20
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Moore’s Law Evolution of Single-Chip Microprocessors
“performance doubles every 18 months” 1971–1980 1981–1990 1991–2000 2010
• common interpretation of Moore’s Law, not original intent Transistor Count 10K–100K 100K–1M 1M–100M 1B
• wrong! “performance” doubles every ~2 years
Clock Frequency 0.2–2MHz 2–20MHz 20M–1GHz 10GHz
• self-fulfilling prophecy (Moore’s Curve)
• 2X every 2 years = ~3% increase per month IPC < 0.1 0.1–0.9 0.9–2.0 10 (?)
• 3% per month used to judge performance features
• if feature adds 9 months to schedule... MIPS/MFLOPS < 0.2 0.2–20 20–2,000 100,000
• ...it should add at least 30% to performance (1.039 = 1.30 → 30%)
• Itanium: under Moore’s Curve in a big way
some perspective: 1971–2001 performance improved 35,000X!!!
• what if cars improved at this rate?
• 1971: 60 MPH / 10 MPG
Q: what do (big bang–2001 and 2000–2003) have in common?
• 2001: 2,100,000 MPH / 350,000 MPG
A: same absolute increase in computing power • but... what if cars crashed as often as computers did?

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 21 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 22
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Performance and Cost Readings


• performance metrics Hennessy & Patterson
• CPU performance equation • Chapter 1
• benchmarks and benchmarking
• reporting averages
• Amdahl’s Law
• Little’s Law
• concepts
• balance
• tradeoffs
• bursty behavior (average and peak performance)
• cost (mostly read on your own)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 23 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 24
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Performance Metrics Performance Metric I: MIPS
latency: response time, execution time MIPS (millions of instructions per second)
• good metric for fixed amount of work (minimize time) • (instruction count / execution time in seconds) x 10-6
throughput: bandwidth, work per time, “performance” – instruction count is not a reliable indicator of work
• some optimizations add instructions
• = (1 / latency) when there is NO OVERLAP • work per instruction varies (FP mult >> register move)
• > (1 / latency) when there is overlap • instruction sets are not equal (3 Pentium instrs != 3 Alpha instrs)
• in real processors there is always overlap (e.g., pipelining) – may vary inversely with actual performance
• good metric for fixed amount of time (maximize work)
relative MIPS
comparing performance • (timereference / timenew) x MIPSreference
• A is N times faster than B iff + a little better than native MIPS
• perf(A)/perf(B) = time(B)/time(A) = N
– but very sensitive to reference machine
• A is X% faster than B iff
• perf(A)/perf(B) = time(B)/time(A) = 1 + X/100 • upshot: may be useful if same ISA/compiler/OS/workload

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 25 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 26
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Performance Metric II: MFLOPS CPU Performance Equation


MFLOPS (millions of floating-point operations per second) processor performance = seconds / program
• (FP ops / execution time) x 10-6 • separate into three components
• like MIPS, but counts only FP operations
• FP ops can’t be optimized away (problem #1 from MIPS)
• FP ops have longest latencies anyway (problem #2) instructions cycles seconds
x x
• FP ops are the same across machines (problem #3) program instruction cycle
– may have been valid in 1980 (most programs were FP)
• most programs today are “integer” i.e., light on FP (#1)
• load from memory takes longer than FP divide (#2) architecture implementation realization
• Cray doesn’t implement divide, Motorola has SQRT, SIN, COS (#3) (ISA) (micro-architecture) (physical layout)
normalized MFLOPS: tries to deal with problem #3 compiler-designer processor-designer circuit-designer
• (canonical FP ops / time) x 10-6
– assign a canonical # FP ops to a HLL program
CPS 206 ECE 252 / CPS 220 ECE 261

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 27 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 28
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
CPU Performance Equation CPU Performance Comparison
instructions / program: dynamic instruction count famous example: “RISC Wars” (RISC vs. CISC)
• mostly determined by program, compiler, ISA • assume
• instructions / program: CISC = P, RISC = 2P
cycles / instruction: CPI • CPI: CISC = 8, RISC = 2
• mostly determined by ISA and CPU/memory organization • CISC time = P x 8 x T = 8PT
seconds / cycle: cycle time, clock time, 1 / clock frequency • RISC time = 2P x 2 x T = 4PT
• mostly determined by technology and CPU organization • RISC time = CISC CPU time/2
uses of CPU performance equation
• high-level performance comparisons the truth is much, much, much more complex
• back of the envelope calculations • actual data from IBM AS/400 (CISC -> RISC in 1995):
• helping architects think about compilers and technology • CISC (IMPI) time = P x 7 x T = 7PT
• RISC (PPC) time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 29 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 30
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

CPU Back-of-the-Envelope Calculation Actually Measuring Performance


base machine how are execution-time & CPI actually measured?
• 43% ALU ops (1 cycle), 21% loads (1 cycle) • execution time: time (Unix cmd): wall-clock, CPU, system
• 12% stores (2 cycles), 24% branches (2 cycles) • CPI = CPU time / (clock frequency * # instructions)
• note: pretending latency is 1 because of pipeline • more useful? CPI breakdown (compute, memory stall, etc.)
Q: should 1 cycle stores be implemented if it slows clock 15%? • so we know what the performance problems are (what to fix)

• old CPI = 0.43 + 0.21 + (0.12 x 2) + (0.24 x 2) = 1.36 measuring CPI breakdown
• new CPI = 0.43 + 0.21 + 0.12 + (0.24 x 2) = 1.24 • hardware event counters (PentiumPro, Alpha DCPI)
• speedup = (P x 1.36 x T) / (P x 1.24 x 1.15T) = 0.95 • calculate CPI using instruction frequencies/event costs

• Answer: NO! • cycle-level microarchitecture simulator (e.g., SimpleScalar)


+ measure exactly what you want
– model microarchitecture faithfully (at least parts of interest)
• method of choice for many architects (yours, too!)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 31 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 32
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Benchmarks and Benchmarking Benchmarks: Instruction Mixes
“program” as unit of work instruction mix: instruction type frequencies
• millions of them, many different kinds, which to use? – ignores dependences
benchmarks + ok for non-pipelined, scalar processor without caches
• the way all processors used to be
• standard programs for measuring/comparing performance
• example: Gibson Mix - developed in 1950’s at IBM
+ represent programs people care about • load/store: 31%, branches: 17%
+ repeatable!! • compare: 4%, shift: 4%, logical: 2%
• fixed add/sub: 6%, float add/sub: 7%
• benchmarking process • float mult: 4%, float div: 2%, fixed mul: 1%, fixed div: <1%
• define workload • qualitatively, these numbers are still useful today!
• extract benchmarks from workload
• execute benchmarks on candidate machines
• project performance on new machine
• run workload on new machine and compare
• not close enough -> repeat

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 33 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 34
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Benchmarks: Toys, Kernels, Synthetics Benchmarks: Real Programs


toy benchmarks: little programs that no one really runs real programs
• e.g., fibonacci, 8 queens + only accurate way to characterize performance
– little value, what real programs do these represent? – requires considerable work (porting)
• scary fact: used to prove the value of RISC in early 80’s
Standard Performance Evaluation Corporation (SPEC)
kernels: important (frequently executed) pieces of real programs
• e.g., Livermore loops, Linpack (inner product) • http://www.spec.org
+ good for focusing on individual features not big picture • collects, standardizes and distributes benchmark suites
– over-emphasize target feature (for better or worse) • consortium made up of industry leaders
• ?!#$: program only included if it makes enough members look good
synthetic benchmarks: programs made up for benchmarking
• SPEC CPU (CPU intensive benchmarks)
• e.g., Whetstone, Dhrystone
• SPEC89, SPEC92, SPEC95, SPEC2000 (consortium at work)
• toy kernels++, which programs do these represent?
• other benchmark suites
• SPECjvm, SPECmail, SPECweb

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 35 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 36
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
SPEC CPU2000 Benchmarking Pitfalls
12 integer programs (C, C++) • benchmark properties mismatch with features studied
• gcc (compiler), perl (interpreter), vortex (database) • e.g., using SPEC for large cache studies
• bzip2, gzip (replace compress), crafty (chess, replaces go) • careless scaling
• eon (rendering), gap (group theoretic enumerations)
• using only first few million instructions (initialization phase)
• twolf, vpr (FPGA place and route)
• reducing program data size
• parser (grammar checker), mcf (network optimization)
• choosing performance from wrong application space
14 floating point programs (C, FORTRAN) • e.g., in a realtime environment, choosing troff
• swim (shallow water model), mgrid (multigrid field solver) • others: SPECweb, TPC-W (amazon.com)
• applu (partial diffeq’s), apsi (air pollution simulation)
• using old benchmarks
• wupwise (quantum chromodynamics), mesa (OpenGL library)
• “benchmark specials”: benchmark-specific optimizations
• art (neural network image recognition), equake (wave propagation)
• fma3d (crash simulation), sixtrack (accelerator design) benchmarks must be continuously maintained and updated!
• lucas (primality testing), galgel (fluid dynamics), ammp (chemistry)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 37 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 38
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Reporting Average Performance What Does The Mean Mean?


averages: one of the things architects frequently get wrong arithmetic mean (AM): average execution times of N programs
+ pay attention now and you won’t get them wrong on exams • ∑1..Ν(time(i)) / N
important things about averages (i.e., means) harmonic mean (HM): average IPCs of N programs
• ideally proportional to execution time (ultimate metric) • arithmetic mean cannot be used for rates (like IPCs)
• AM for times • 30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH
• HM for rates (IPCs) • N / ∑1..N(1 / rate(i))
• GM for ratios (speedups), time proportionality impossible
• there is no such thing as the average program geometric mean (GM): average speedups of N programs
• use average when absolutely necessary • N√(∏1..N(speedup(i))
what if programs run at different frequencies within workload?
• “weighting”
• weighted AM = (∑1..N w(i) * time(i)) / N

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 39 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 40
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
GM Weirdness Amdahl’s Law
what about averaging ratios (speedups)? “Validity of the Single-Processor Approach to Achieving Large-
Scale Computing Capabilities” –G. Amdahl, AFIPS, 1967
• HM / AM change depending on which machine is the base
• let optimization speed up fraction f of program by factor s
machine A machine B B/A A/B • speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)
Program1 1 10 10 0.1 •
Program2 1000 100 0.1 10
(10+.1)/2 = 5.05 (.1+10)/2 = 5.05
• f = 95%, s = 1.1 → 1/[(1-0.95) + (0.95/1.1)] = 1.094
AM B is 5.05 times faster! A is 5.05 times faster!
• f = 5%, s = 10 → 1/[(1-0.05) + (0.05/10)] = 1.047
2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05
HM B is 5.05 times faster! A is 5.05 times faster! • f = 5%, s = ∞ → 1/[(1-0.05) + (0.05/∞)] = 1.052
GM √(10*.1) = 1 √(.1*10) = 1 • f = 95%, s ∞ → 1/[(1-0.95) + (0.95/∞)] = 20
– geometric mean of ratios is not proportional to total time! make common case fast, but...
• if we take total execution time, B is 9.1 times faster
• GM says they are equal
...uncommon case eventually limits performance

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 41 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 42
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Little’s Law System Balance


Key Relationship between latency and bandwidth: each system component produces & consumes data
Average number in system = arrival rate * avg. holding time • make sure data supply and demand is balanced
• X demand >= X supply ⇒ computation is “X-bound”
• e.g., memory bound, CPU-bound, I/O-bound
Example: • goal: be bound everywhere at once (why?)
• How big a wine cellar should I build? • X can be bandwidth or latency
• We drink (and buy) an average of 4 bottles per week • X is bandwidth ⇒ buy more bandwidth
• X is latency ⇒ much tougher problem
• On average, I want to age my wine 5 years
• bottles in cellar = 4 bottles/week * 52 weeks/year * 5 years
• = 1040 bottles

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 43 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 44
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Tradeoffs Bursty Behavior
“Bandwidth problems can be solved with money. Latency Q: to sustain 2 IPC... how many instructions should processor be
problems are harder, because the speed of light is fixed and you able to
can’t bribe God” –someone famous (John Cocke?)
• fetch per cycle?
well... • execute per cycle?
• can convert some latency problems to bandwidth problems • complete per cycle?
• solve those with money
A: NOT 2 (more than 2)
• the famous “bandwidth/latency tradeoff”
• dependences will cause stalls (under-utilization)
• if desired performance is X, peak performance must be > X
• architecture is the art of making tradeoffs
programs don’t always obey “average” behavior
• can’t design processor only to handle average behvaior

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 45 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 46
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Cost Startup and Unit Costs


very important to real designs startup cost: manufacturing
• startup cost • fabrication plant, clean rooms, lithography, etc. (~$3B)
• one large investment per chip (or family of chips) • chip testers/debuggers (~$5M a piece, typically ~200)
• increases with time
• few companies can play this game (Intel, IBM, Sun)
• unit cost
• cost to produce individual copies • equipment more expensive as devices shrink
• decreases with time
startup cost: research and development
• only loose correlation to price and profit
• 300–500 person years, mostly spent in verification
Moore’s corollary: price of high-performance system is constant • need more people as designs become more complex
• performance doubles every 18 months
unit cost: manufacturing
• cost per function (unit cost) halves every 18 months
• raw materials, chemicals, process time (2–5K per wafer)
• assumes startup costs are constant (they aren’t)
• decreased by improved technology & experience
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 47 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 48
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Unit Cost and Die Size (Chip Area) Unit Cost → Price
unit cost most strongly influenced by physical size of chip (die) if chip cost 25$ to manufacture, why do they cost $500 to buy?
• semiconductors built on silicon wafers (8”) • integrated circuit costs 25$
• chemical+photolithographic steps create transistor/wire layers
• typical number of metal layers (M) today is 6 (α = ~4) • must still be tested, packaged, and tested again
• cost per wafer is roughly constant C0 + C1 * α (~$5000) • testing (time == $): $5 per working chip
• basic cost per chip proportional to chip area (mm2) • packaging (ceramic+pins): $30
• typical: 150–200mm2, 50mm2 (embedded)–300mm2 (Itanium) • more expensive for more pins or if chip is dissipates a lot of heat
• typical: 300–600 dies per wafer • packaging yield < 100% (but high)
• post-packaging test: another 5$
• yield (% working chips) inversely proportional to area and α
• non-zero defect density (manufacturing defect per unit area) • total for packaged chip: ~$65
• P(working chip) = (1 + (defect density * die area)/α)–α • spread startup cost over volume ($100–200 per chip)
• typical defect density: 0.005 per mm2 • proliferations (i.e., shrinks) are startup free (help profits)
• typical yield: (1 + (0.005 * 200) / 4)–4 = 40% • Intel needs to make a profit...
• typical cost per chip: $5000 / (500 * 40%) = $25 • ... and so does Dell
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 49 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 50
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction

Reading Summary: Performance


H+P
• chapter 1

next up: instruction set design (H+P, chapter 2)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 51
Sohi, Smith, Vijaykumar, Lipasti Introduction

You might also like